Skip to content

[#33] Phase A: download X-post photos and render in notes#35

Merged
VGonPa merged 18 commits into
developfrom
ws-33-images-phase-a
May 24, 2026
Merged

[#33] Phase A: download X-post photos and render in notes#35
VGonPa merged 18 commits into
developfrom
ws-33-images-phase-a

Conversation

@VGonPa
Copy link
Copy Markdown
Owner

@VGonPa VGonPa commented May 24, 2026

Closes #33.

Summary

Phase A of the media subsystem: xbrain media downloads X-post photos to disk, the data model tracks per-photo state via a tagged union, and xbrain generate renders downloaded photos inline in Obsidian notes. Videos remain in a video_pending state for a future phase.

  • New command xbrain media [--force] [--limit N] [--items a,b,c] walks every photo URL in Item.media, downloads bytes from pbs.twimg.com with a size cascade (name=origlargemedium), validates with Pillow, and atomically writes the file under data/media/<item-id>/<index>.<ext>. Auto-snapshots data/ first (label pre-media).
  • Tagged union model replaces the flat Media(type, url): MediaPhotoPending / MediaPhotoDownloaded / MediaPhotoFailed / MediaVideoPending, discriminated by kind. Mirrors the ContentSource design from Refactor: FetchResult / ContentSource as tagged unions #20 — illegal states unrepresentable. A _normalise_legacy_media BeforeValidator promotes the pre-Phase-A {type, url} shape on read; no manual migration.
  • Failure categorisation mirrors xbrain.fetch (fetch: retry transiently-failed items without --force #19): transient bucket (http_5xx, timeout, unknown_error) auto-retried on the next run; permanent bucket (http_4xx, format_error) only with --force. Total-failure raises RuntimeError (mirror of Narrow the broad blocks in ApiExecutor + topic_synth #24).
  • Ctrl-C safety: orchestrator calls a on_progress callback after every photo transition; CLI persists items.json atomically between photos.
  • Rendering: xbrain generate mirrors photos from data/media/ into <output_dir>/_media/ at render time and embeds them inline (![[_media/<id>/<n>.<ext>]]) so the vault is fully self-contained.
  • xbrain diff extension: new MediaDiff block reports per-variant counts on both sides + the four state-transition deltas (+N/-N).

Acceptance criteria

Verified locally (the full-corpus end-to-end run is reserved for the reviewer's machine per the task brief):

  • Running xbrain media on the corpus completes without unhandled exceptions (covered by tests/test_media.py — happy path, all five failure categories, total-failure raise, partial-failure no-raise).
  • Every photo URL ends in a defined state after a run (the type system enforces it; _record_outcome is exhaustive over the post-transition variants).
  • Photos land on disk in a deterministic path (data/media/<id>/<n>.<ext>).
  • Re-running xbrain media is a no-op for already-downloaded photos (test_download_all_idempotent_for_already_downloaded).
  • --force re-downloads everything (test_download_all_force_redownloads_already_downloaded).
  • xbrain generate produces notes with photos visible (test_generate_renders_downloaded_photo_as_obsidian_embed, mirror-into-vault verified).
  • xbrain diff reports new-download counts (test_diff_reports_media_state_counts).
  • Ctrl-C leaves items.json valid (test_download_all_propagates_keyboard_interrupt + the CLI's atomic per-photo persistence).
  • Existing data loads without manual migration (test_media_legacy_photo_shape_migrates_to_pending, test_item_with_legacy_media_loads_into_pending_variant).
  • A snapshot is created automatically before the run (test_media_creates_pre_snapshot).
  • Reserved for reviewer end-to-end: ≥95% downloaded after a full-corpus run, manual spot-check of 10 random notes, total disk usage ≤350 MB.

Deviations from the implementation plan

  1. Embed path: the plan specified ![[data/media/<id>/<n>.<ext>]] and "verify in the end-to-end test" that Obsidian could resolve the relative path. That only works when the vault root and data/ align, which is not the common case (Víctor's vault is ~/Documents/Vault/vault/, the repo is ~/devel/xbrain/). Instead, xbrain generate mirrors each MediaPhotoDownloaded from data/media/ into <output_dir>/_media/ at render time and embeds ![[_media/<id>/<n>.<ext>]]. The canonical store stays at data/media/; the vault copy is a render-time artifact. Documented in README and ARCHITECTURE.
  2. Media as a factory function: the constraint forbade modifying extract/graphql.py and archive.py "except possibly to import the new media types". Rather than rewriting their call sites, Media(type=..., url=...) is preserved as a factory function in models.py that returns MediaPhotoPending or MediaVideoPending. Existing call sites work unchanged. I did add minimal type annotation updates (list[Media]list[MediaEntry]) in those two files because mypy needs the type alias for the new union; the runtime calls are untouched.
  3. requests declared explicitly in pyproject.toml: was previously a transitive dep of trafilatura. Deptry flagged DEP003 for the new direct import in media.py; declaring it explicitly is the right hygiene.
  4. _has_note extended to include items with media: a photo-only tweet was previously invisible in the wiki. The plan didn't call this out, but xbrain generate needs to render notes for photo-only items now that photos have a render block. Pinned by test_generate_media_only_item_gets_a_note.

Open questions / known limitations

  • Disk budget unverified at scale. The 350 MB target is a guess based on average photo sizes; the actual number will land after the reviewer's full-corpus run.
  • Concurrency: the plan suggested a ThreadPoolExecutor(max_workers=4) parallelism cap. The current implementation is sequential with a 0.5 s throttle between requests. Concurrency was deferred to keep the Ctrl-C-coherency contract simple — adding workers safely requires a lock around the on_progress writes. If the full-corpus run is too slow, a follow-up PR can add bounded parallelism.
  • Snapshot does not include data/media/: xbrain snapshot still only copies the JSON artifacts. A snapshot restore of a pre-media snapshot will revert items.json but leave the bytes on disk (the variants still pointing at them, but the timestamp drift is observable). Fine for Phase A: Download X-post photos and render in Obsidian notes #33; a follow-up could decide whether data/media/ should be part of the snapshot envelope.

Test plan

  • uv run pytest -q — should report 412 passed (358 baseline + 54 new).
  • uv run bash scripts/check.sh — all critical checks must pass; coverage 90%.
  • uv run xbrain media --help — confirm the three flags are present.
  • uv run xbrain media on the real corpus (data/items.json, ~1884 items) — reviewer's machine. Verify:
    • ≥95% of photo URLs end in MediaPhotoDownloaded.
    • Total disk usage on data/media/ ≤ 350 MB.
    • xbrain generate then a manual spot-check of 10 random notes — every photo renders inline in Obsidian.
    • xbrain diff <pre-media-snapshot> reports the new-download counts on the MEDIA block.
    • Re-running xbrain media reports every previously-downloaded photo on the SUMMARY's skipped count and creates zero new files.

🤖 Generated with Claude Code

VGonPa added 15 commits May 24, 2026 13:42
Replaces the flat `Media(type, url)` model with a four-variant tagged
union — `MediaPhotoPending` / `MediaPhotoDownloaded` / `MediaPhotoFailed`
/ `MediaVideoPending` — discriminated by `kind`. Mirrors the
`ContentSource` design from #20: illegal states (e.g. "downloaded"
without `local_path`) are unrepresentable.

A `_normalise_legacy_media` BeforeValidator promotes the pre-Phase-A
`{type, url}` shape on read so existing `data/items.json` files load
without a manual migration. A `Media(...)` factory function preserves
the constructor signature used by `extract/graphql.py` and `archive.py`
— those modules are out of scope for #33 and keep working unchanged.

`MediaFailureReason` adds the categorised failure bucket
(http_4xx / http_5xx / timeout / format_error / unknown_error) the
downloader will fill in. Transient vs permanent semantics mirror #19.

Tests cover legacy migration (photo + video), pass-through of the new
shape, the discriminator rejecting unknown kinds, required-field
enforcement on the failure variant, the factory routing, and a full
`Item` round-trip through the legacy media shape.

Refs #33
`xbrain.media.download_all` walks every photo entry across the store,
downloads bytes from pbs.twimg.com with a size cascade
(`name=orig` → `name=large` → `name=medium`), validates the bytes
with Pillow, and writes the file atomically (tmp + rename) under
`data/media/<item-id>/<index>.<ext>`.

Failure handling mirrors `xbrain.fetch` (#19/#20):
- 4xx → `http_4xx` (permanent)
- 5xx → `http_5xx` (transient — auto-retried next run)
- `requests.Timeout` → `timeout` (transient)
- Connection errors / unknown exceptions → `unknown_error` (transient)
- Bytes that Pillow can't decode → `format_error` (permanent)

`KeyboardInterrupt` propagates untouched (narrow except clauses). A
total failure (all attempts failed) raises `RuntimeError`, mirroring
`ApiExecutor.enrich_items` (#24) — the CLI's `_handle_cli_errors`
turns this into a clean operator message + exit 1.

Idempotency: a re-run is a no-op for `MediaPhotoDownloaded`. A
progress callback fires after each photo transition, so the CLI
can persist `items.json` between photos — Ctrl-C mid-batch leaves
a coherent store.

I/O is dependency-injected (`session`, `sleep`, `on_progress`) so
tests run offline with a hand-rolled `FakeSession` — no
`requests-mock` dependency in the unit path.

Tests cover: 200-on-orig success, cascade fallback, idempotency,
all five failure buckets, `--force`, `--limit`, `--items` filter,
throttling, progress callback, video-pending no-op, Ctrl-C
propagation, total-failure raise, partial-failure no-raise,
atomic write (no `.part` residue), summary-line emission silence
vs. full counters.

Adds Pillow as a runtime dep (`uv add pillow`).

Refs #33
The `media` command wires the Phase A downloader into the CLI:
1. Auto-snapshots `data/` first via `_auto_snapshot(cfg, "media")`,
   reusing the #17 destructive-op recovery boundary.
2. Loads `items.json`, runs `xbrain.media.download_all` with a
   per-photo `on_progress` callback that persists the store
   atomically — Ctrl-C mid-run leaves a coherent `items.json`.
3. Persists once more in a finally block so the partial state
   from a `RuntimeError`-on-total-failure is not lost.
4. Emits the SUMMARY line on stderr (mirror of #24) and an
   operator-friendly recap on stdout.

Flags: `--force` (re-download already-downloaded photos),
`--limit N` (cap attempts), `--items <a,b,c>` (filter by id).

`Config.media_dir` resolves to `data/media/` — under the existing
gitignored `data/*` umbrella, no `.gitignore` change needed.

Tests cover the snapshot-before-run invariant (mirrors the existing
fetch/vocab/topics auto-snapshot tests), the full happy path
through a fake `requests.Session` (no real network), the
`--items` filter, and the total-failure → exit-1 contract with
the partial-state-persisted invariant verified.

Refs #33
`xbrain generate` now handles every media variant introduced by #33:

- `MediaPhotoDownloaded` → embedded inline as `![[_media/<id>/<n>.<ext>]]`
  after the tweet body, before the `## Enlaces` section. Files are
  mirrored from `data/media/` into `<output_dir>/_media/` at render
  time so the vault is fully self-contained — no Obsidian config or
  symlinks needed on the user side.
- `MediaPhotoFailed` → one-line ⚠ warning carrying the original URL
  and the failure reason (translated to user-facing Spanish via
  `_FAILURE_ES_MEDIA`).
- `MediaPhotoPending` → silent. Not an error, just "not yet processed".
- `MediaVideoPending` → 🎥 placeholder with the URL. Phase A does not
  download videos; the URL is the only evidence we surface.

`_has_note` extended: an item with only media (no link, no enrichment)
is now note-worthy. A photo-only tweet was previously invisible in the
wiki — now it gets its own page as soon as `xbrain media` advances
the variant.

Render order in the Tweet section:
  tweet text → photos → links → external article content

Deviation from the implementation plan: the plan called for embeds
pointing at `data/media/<id>/<n>.<ext>` and "verifying in the
end-to-end test" that Obsidian could resolve the relative path. That
only works when the vault root and `data/` align, which is not the
common case. Mirroring the photos into `<output_dir>/_media/` makes
the embed resolve unconditionally for ~zero extra disk (the canonical
copy stays at `data/media/`; the vault copy is a render-time
artifact). Documented in README.

Tests cover all four render branches, the media-block-before-Enlaces
ordering, multi-photo inline rendering, missing-bytes recovery (logs
+ keeps embed), and backward-compat when `generate(media_root=None)`.

Refs #33
`DiffReport` now carries a `media: MediaDiff` block with per-variant
counts on both sides (A/B) plus the four deltas
(`delta_downloaded`, `delta_pending`, `delta_failed`,
`delta_video_pending`). `DiffSummary` surfaces the two highest-signal
counters (`media_delta_downloaded`, `media_delta_failed`) at the top
level for quick-glance ops review.

The text formatter adds a `MEDIA` block with right-aligned
A/B columns and `±N` deltas. The JSON formatter is automatic via
pydantic — `media` becomes a fifth top-level key. The CLI tests that
asserted the top-level key set have been updated.

Negative deltas are legal — `xbrain media --force` can move a
previously-downloaded photo back to `failed` if the URL turned 404
in the meantime. `MediaStateCounts` carries the absolute counts
unsigned; the deltas carry sign.

Refs #33
README:
- Add `media` row to the Commands table.
- New "Local media storage" section covering on-disk layout
  (`data/media/<id>/<n>.<ext>`), vault rendering convention
  (mirrored into `<output_subdir>/_media/` for self-contained
  embeds), disk budget guidance, throttling, and the
  transient/permanent failure-retry contract.

ARCHITECTURE:
- New "media" entry under per-stage detail (state machine,
  Ctrl-C safety, snapshot trigger).
- Artifacts table row for `media/<id>/<n>.<ext>`.
- Invariant #10: media variants are mutually exclusive states;
  state transitions only via `xbrain media`; legacy `{type, url}`
  records migrate on read.

Refs #33
No behaviour change. ruff format normalises trailing commas, line
breaks and trivia in the files touched by the Phase A series.

Refs #33
Split the 19-line orchestrator into three small helpers:

- `_build_session` extracts the new-vs-injected session decision so
  `download_all` does not carry a one-off branch.
- `_eligible_items` is a generator that filters by `items_filter`
  and skips items without media, so the orchestrator reads as a
  flat "for each eligible item, process it" loop.
- `_DownloadContext` is a frozen dataclass that bundles the inner-
  loop deps (session, timeouts, throttle, callbacks, force) so
  `_process_item` takes one context argument instead of seven.

`download_all` drops from C(19) → B(7); `_process_item` stays at
B(10); no behaviour change, all 412 tests still pass.

Refs #33
`xbrain.media` imports `requests` directly for the HTTP session.
Pulling it through trafilatura's transitive dep was a deptry DEP003
warning; declaring it explicitly is the right hygiene.

No behaviour change — the same `requests-2.34.2` was already
installed via trafilatura.

Refs #33
… wrap write

- Narrow `_decode_image` exception set to Pillow-specific
  (UnidentifiedImageError, DecompressionBombError, SyntaxError). Parent
  OSError swallowed real I/O bugs (disk/permission/network).
- Wire `logger.warning` per failure in `_record_outcome` — the
  total-failure RuntimeError now actually has the breadcrumbs it
  references.
- Wrap `_write_bytes` call in `_download_one` with `try/except OSError`,
  bucketing disk-full / permission / read-only-fs errors as
  `unknown_error` (transient) so a single bad write doesn't abort the
  whole batch.
- Sweep stale `*.part` files at the entry to `download_all` to clean up
  SIGKILL/OOM orphans from previous runs.
- Document persistence-failure semantics in `_run_media` docstring.
- Add `AVISO:` stderr line on `--items` filter that matches no item.
- Add `--verbose` flag printing failed (item_id, reason, url) at the end.
- Drop SessionProtocol (defined for one use); use `requests.Session | None`
  directly in `download_all`.
- Drop `_DownloadContext` dataclass, `_process_item` helper, and the
  mutable-list `remaining=[limit]` "box" trick. Inline the per-item loop
  into `download_all` with a plain `int | None` counter and
  `enumerate(item.media)` instead of `range(len)`.
- Inline `_build_session` (used once), `_eligible_items` (used once), and
  `_finalise` (one-line, called twice) into `download_all`.
- Drop `original` param from `_record_outcome` and `entry` param from
  `_failed` — both were unused and rationalised by speculative comments.
  Rewrite the now-truthful docstrings.
- Move `_REASON_SEVERITY` to module top alongside `_TRANSIENT_MEDIA_FAILURES`
  and `_SIZE_CASCADE` (was defined mid-file). Drop the unused
  `format_error: 5` row — `format_error` is an early-return path, never
  cascade-compared.
- Cap `_format_error` output at 500 chars to keep hostile CDN bodies out
  of `items.json`.
- Add `typing.assert_never` exhaustiveness at the three media-variant
  isinstance chains (media.py:_is_eligible, generate.py:_render_media_lines,
  diff.py:_count_media_variants) so Phase B catches missed variants
  statically.
- Trim media.py module docstring from 5 paragraphs to 2.
- `MediaPhotoDownloaded.width / height / bytes_size` now `Field(gt=0)`.
  A zero-pixel or zero-byte downloaded photo is semantically illegal;
  Pillow validation rules it out at the seam, the type pins it at the
  data layer.
- `MediaPhotoDownloaded.local_path` gets a pydantic validator that
  rejects absolute paths and `..` components. The downloader never
  builds such a path, but `items.json` is on-disk plain JSON the user
  can edit — defence in depth at the type boundary is cheap.
- `MediaPhotoFailed.attempts` drops its default and becomes
  `Field(ge=1)`. "Failed but never attempted" is nonsense; the
  downloader always increments before producing the variant.
- `_MediaPhotoBase` docstring corrected: `type` was never the
  discriminator, it's preserved for wire-compat with the legacy
  `{type, url}` shape.
- `Media()` factory annotated `# noqa: N802` (the rule is intentional)
  with a TODO marker to remove when the extractor/archive migrate to
  direct variant construction.
- Tests updated to pass `attempts=1` where they previously relied on
  the default.
- Strip `Phase A`, `Phase B`, `(#33)`, `pre-#33`, `pre-Phase-A` markers
  from new/modified code, docstrings, and tests. Also strip `(#17)`,
  `(#19)`, `(#20)`, `(#24)` references added in this PR's diff. The PR
  description carries the issue link; code should describe lasting
  invariants, not the PR that introduced them.
- Rewrite `_TRANSIENT_MEDIA_FAILURES` cross-reference comment: was
  `_TRANSIENT_FAILURES` (the actual symbol in fetch.py).
- Reword `Failed(transient)` "terminal-ish" phrasing in ARCHITECTURE.md
  `### media` — `Failed(transient)` IS auto-retried; not terminal.
- Reconcile snapshot/media docs: snapshots cover only the four JSON
  artifacts; photo bytes under `data/media/` are NOT snapshotted today.
  Updated `config.py:media_dir` docstring and ARCHITECTURE.md to make
  the carve-out explicit; re-downloading is the recovery path.
- Trim ARCHITECTURE.md invariant #10 to match the brief style of
  invariants #1-#9. Retry contract and storage layout moved to the
  `### media` section above (now with storage layout subsection).
- Trim the `media` row in README.md Commands table to one sentence with
  a link to `Local media storage`. Mark the disk-budget numbers as
  approximate.
…bose

- Split the old `test_download_all_records_http_4xx_failure_when_cascade_exhausted`
  into two: one asserts the bucket via a partial-success setup; the other
  asserts the lone-attempt RuntimeError. Mixing both concerns made the
  original test confusing.
- Add `test_download_all_falls_back_through_full_cascade_to_medium`: orig
  404 + large 404 + medium 200 — proves the third cascade step works.
- Add WebP and JPEG extension tests: `.webp` and `.jpg` URLs end up with
  the right extension on disk.
- Add 3xx (304) handling: bucketed as `unknown_error` (transient).
- Add atomic-write rollback test: simulate `Path.replace` raising; the
  `.part` file is cleaned up and no half-written final file remains.
- Add `_sweep_part_orphans` smoke test: a stale `*.part` left by a SIGKILL
  is removed on next `download_all` entry.
- Add partial-failure SUMMARY line test: a run with 0 downloads + N
  failures is NOT silent — ops needs the count.
- Add Pillow truncated-PNG test: `_png_bytes()[:32]` lands in the
  `format_error` bucket.
- Add `test_diff_media_reports_delta_failed`: diff surfaces a `delta_failed`
  count between snapshots (both on `media.delta_failed` and
  `summary.media_delta_failed`).
- Add `test_media_command_resume_after_interrupt_completes_remaining`:
  the full Ctrl-C-and-resume integration round.
- Add `test_media_command_warns_when_items_filter_matches_nothing`:
  AVISO line on a no-match `--items` filter.
- Add `test_media_command_verbose_lists_failed_urls`: `--verbose` prints
  `<item_id> <reason> <url>` per failure on stderr.
- Tighten `test_media_command_runs_on_empty_store`: assert exit-code 0
  AND items.json is byte-identical (no spurious rewrite).
- Fix `test_generate_renders_failed_photo_as_warning`: replace
  `"URL no encontrada" in body or "HTTP 4xx" in body` with the actual
  rendered string `"URL no encontrada (HTTP 4xx)"`.
- Add `_ = cls` line to `_reject_path_traversal` to silence vulture's
  unused-variable warning; pydantic's `@field_validator + @classmethod`
  contract requires the `cls` parameter even when the body does not
  reference it.
- Apply `ruff format` to `tests/test_cli.py`. No semantic changes.
@VGonPa
Copy link
Copy Markdown
Owner Author

VGonPa commented May 24, 2026

Fix round 1 — all reviewer flags addressed

6 commits pushed on top of the original PR. Tests grew from 412 → 425
(coverage on media.py 93% → 96%). poe check is clean on the
critical gates; the remaining warnings (radon C on download_all /
_download_one, deptry on browser_cookie3 in scripts) are explained
below.

Category 1 — Silent failure regressions (738390d)

  • B1 Narrowed _decode_image to (UnidentifiedImageError, Image.DecompressionBombError, SyntaxError). Real I/O bugs (FileNotFoundError, PermissionError, network) no longer get silently bucketed as format_error.
  • B2 Added _sweep_part_orphans(media_root) on download_all entry; documented in _write_bytes. SIGKILL/OOM *.part orphans are now reclaimed on next run.
  • H1 Wired logger.warning(item=, url=, reason=, error=) in _record_outcome on every failure — the total-failure RuntimeError's "see warnings above" now actually has breadcrumbs.
  • H2 Wrapped _write_bytes in _download_one with try/except OSError, bucketed as unknown_error (transient). Disk-full / read-only / permission errors no longer abort the whole batch.
  • H3 Documented persistence-failure semantics on _run_media docstring (disk full inside on_progress aborts the run; re-run picks up still-pending photos).
  • Added AVISO: stderr line on --items filter that matches no item; added --verbose flag printing <item_id> <reason> <url> per failure.

Category 2 — Simplification (e053be5)

  • Dropped SessionProtocol, _DownloadContext, _process_item, _build_session, _eligible_items, _finalise. Replaced the remaining=[limit] "box" with a plain int | None. The body of download_all is now a flat 35-line loop instead of three-layer indirection.
  • Removed unused params: original on _record_outcome, entry on _failed. Rewrote the now-truthful docstrings.
  • Moved _REASON_SEVERITY to module top alongside the other constants; dropped the unused format_error: 5 row.
  • Capped _format_error output at 500 chars (_MAX_ERROR_LEN) so a chatty CDN body can't bloat items.json.
  • Added typing.assert_never exhaustiveness at the three media-variant isinstance chains: media.py:_is_eligible, generate.py:_render_media_lines, diff.py:_count_media_variants. Phase B will catch missed variants at mypy time.
  • Trimmed media.py module docstring 5 → 2 paragraphs.
  • Trade-off: download_all radon grade went C (was B before the inlining). The reviewer explicitly asked for the inlining; the function reads cleanly as a single loop with two guards.

Category 3 — Type design hardening (a1a6aa7)

  • MediaPhotoDownloaded.width / height / bytes_sizeField(gt=0). Zero-pixel / zero-byte is illegal.
  • MediaPhotoDownloaded.local_path → pydantic validator rejects absolute paths and .. components. Items.json is on-disk plain JSON the user can edit; defence in depth.
  • MediaPhotoFailed.attemptsField(ge=1), no default. "Failed but never attempted" is nonsense.
  • _MediaPhotoBase docstring corrected — type was never the discriminator, it's wire-compat only.
  • Media() factory annotated # noqa: N802 with a TODO marker for removal when the extractor/archive migrate to direct construction.

Category 4 — Issue-number rot + lying docs (6399f37)

  • Stripped Phase A, Phase B, (#33), pre-#33, pre-Phase-A markers from new/modified code, tests, and docstrings.
  • Stripped (#17), (#19), (#20), (#24) references added in this PR's diff (preserved pre-existing ones in untouched code per scope).
  • Corrected _TRANSIENT_MEDIA_FAILURES cross-reference (was _TRANSIENT_FAILURES in fetch.py — symbol name was right but the rot marker (#19) removed per the no-rot rule).
  • Reworded Failed(transient) "terminal-ish" phrasing in ARCHITECTURE.md ### media. Added a Storage layout subsection.
  • Reconciled snapshot/media docs: confirmed snapshot.py covers only the 4 JSON artifacts; updated config.py:media_dir docstring and ARCHITECTURE.md to acknowledge the carve-out explicitly. Re-downloading is the recovery path.
  • Trimmed ARCHITECTURE.md invariant docs: README — sharper why + layered examples #10 to match WS2 Fase 1: enrichment core pipeline #1-Promote to main: WS2 Fase 1+2, quality gate, vocab worksheet, Safari auth, README #9 brevity; retry contract + storage layout moved to the ### media section.
  • Trimmed README.md media row in Commands table to a single sentence + section link. Marked disk-budget numbers as approximate.

Category 5 — Tests (c8e3f45)

  • Split test_download_all_records_http_4xx_failure_when_cascade_exhausted into two: bucket assertion via partial-success setup; total-failure raise via lone-attempt setup.
  • Added test_download_all_falls_back_through_full_cascade_to_medium (orig+large 404 → medium 200).
  • Added WebP and JPEG extension tests.
  • Added 3xx (304) handling test → unknown_error (transient).
  • Added atomic-write rollback test (monkeypatch Path.replace, partial-success co-item to avoid the total-failure raise).
  • Added _sweep_part_orphans smoke test.
  • Added partial-failure SUMMARY line test (0 downloads + N failures still emits).
  • Added Pillow truncated-PNG test (_png_bytes()[:32]format_error).
  • Added test_diff_media_reports_delta_failed.
  • Added test_media_command_resume_after_interrupt_completes_remaining — the full Ctrl-C-and-resume integration round.
  • Added test_media_command_warns_when_items_filter_matches_nothing (AVISO check).
  • Added test_media_command_verbose_lists_failed_urls.
  • Tightened test_media_command_runs_on_empty_store (exit-code 0 + items.json byte-identical).
  • Replaced brittle "URL no encontrada" in body or "HTTP 4xx" in body with the full rendered phrase "URL no encontrada (HTTP 4xx)".

Category 6 — Operational polish

  • --verbose flag now exists, prints failed URLs on completion. MediaReport.per_item_failures is kept (tests rely on it) and wired to the flag.
  • AVISO: stderr warning on --items mismatch mirrors _run_extract.

Final state

  • uv run pytest: 425 passed (was 412)
  • uv run ruff check: clean
  • uv run ruff format --check: clean (excluding pre-existing scripts/import_chrome_session.py, out of scope)
  • uv run mypy --ignore-missing-imports src/: clean
  • uv run poe check: ALL CRITICAL CHECKS PASSED
  • media.py coverage: 96% (was 93%)

VGonPa added 3 commits May 24, 2026 14:47
The PR description carries the issue link; code shouldn't reference PR
numbers in docstrings. Replace '(mirror of #24)' / '(mirrors #24)' with
behavioural phrasing so the tests still self-document but don't go stale
when the PR is closed.
The placate-vulture stub (`_ = cls`) was the first statement of the
function body, so Python parsed the subsequent triple-quoted block as a
no-op expression rather than as the function's docstring. Result:
inspect.getdoc(MediaPhotoDownloaded._reject_path_traversal) returned
None, hiding the rationale for the validator from tooling and IDEs.

Move the docstring to the first-statement position (where Python looks
for it) and put the vulture stub immediately after.
The new MediaPhotoDownloaded / MediaPhotoFailed validators (path
traversal, gt=0 dimensions, ge=1 attempts) were only exercised by the
happy paths. Add explicit rejection tests so a future refactor that
weakens a constraint trips a red test, not a silent regression at the
data boundary.

Six new tests, one per rejection path:
- absolute local_path
- parent-traversal local_path
- bytes_size=0
- width=0
- height=0
- attempts=0 on MediaPhotoFailed
@VGonPa
Copy link
Copy Markdown
Owner Author

VGonPa commented May 24, 2026

Verify-round flags resolved (3 commits, fix-round-2):

0165076 — strip stray issue-number references ((mirror of #24) / (mirrors #24)) from tests/test_media.py + tests/test_cli.py docstrings. Replaced with behavioural phrasing.

e1ec720 — fix path-traversal validator docstring positioning in src/xbrain/models.py. The _ = cls stub was the first statement of the function body, so Python parsed the triple-quoted block as an expression statement, not the docstring. inspect.getdoc(MediaPhotoDownloaded._reject_path_traversal) returned None before; now returns the docstring text. Moved the stub to immediately after the docstring.

a892da5 — add 6 validator-rejection tests in tests/test_models.py:

  • test_media_photo_downloaded_rejects_absolute_local_path
  • test_media_photo_downloaded_rejects_parent_traversal_local_path
  • test_media_photo_downloaded_rejects_zero_bytes_size
  • test_media_photo_downloaded_rejects_zero_width
  • test_media_photo_downloaded_rejects_zero_height
  • test_media_photo_failed_rejects_zero_attempts

Gates: 431 passed (was 425), ruff clean, mypy clean. Ready for re-verify.

@VGonPa VGonPa merged commit 2caf1dd into develop May 24, 2026
1 check passed
@VGonPa VGonPa deleted the ws-33-images-phase-a branch May 24, 2026 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant