Skip to content

[#20] refactor: ContentSource/FetchResult as tagged unions#29

Merged
VGonPa merged 3 commits into
developfrom
ws-20-tagged-unions
May 22, 2026
Merged

[#20] refactor: ContentSource/FetchResult as tagged unions#29
VGonPa merged 3 commits into
developfrom
ws-20-tagged-unions

Conversation

@VGonPa
Copy link
Copy Markdown
Owner

@VGonPa VGonPa commented May 22, 2026

Summary

Refactor ContentSource and FetchResult from "single class with optional fields" into pydantic discriminated unions, so success and failure are mutually exclusive variants enforced by both the runtime (pydantic) and the type system (mypy with the pydantic.mypy plugin).

  • ContentSource = Annotated[Union[ContentSourceSuccess, ContentSourceFailure], BeforeValidator(...), Field(discriminator="outcome")]
  • FetchResult = FetchSuccess | FetchFailure (in-memory only, not persisted)
  • Wire-compatible discriminator: outcome: "success" | "failure"
  • Backward-compatible deserialisation: legacy ok: bool records in existing data/items.json are normalised on read by a BeforeValidator on the union — no manual migration command required.
  • Every consumer (fetch.py, fetch_x.py, extract/threads.py, generate.py, worksheet.py, executors/api.py, notes_io.py) now switches on the variant via isinstance instead of reading optional sentinel fields.

Closes #20.

Decisions on the four open questions (from the PRD)

  1. Discriminator field name: outcome ("success" / "failure"). kind was taken by ContentSource.kind (external_article / x_article / …); ok as a literal was ambiguous next to its legacy boolean meaning. outcome is self-documenting in the JSON.
  2. JSON wire compat for legacy records: A BeforeValidator wraps the discriminated union. It maps ok=Trueoutcome="success" (and drops failure-only fields), ok=Falseoutcome="failure" (and drops success-only fields). Legacy records with ok=False, failure_reason=None (one such record in Víctor's data/items.json — an HTTP 429) bucket to failure_reason="timeout", which preserves the fetch: retry transiently-failed items without --force #19 "auto-retry uncategorised failures" behaviour without an extra branch in _should_refetch. Round-trip verified against the real 1884-item corpus: 274 success + 43 failure sources, all migrated.
  3. mypy enforcement: Enabled pydantic.mypy plugin (with init_typed = true, init_forbid_extra = true). A tests/type_probes/illegal_states.py probe contains four annotations that MUST be mypy errors; tests/test_type_safety.py shells out to mypy and asserts all four fire. This pins the "illegal states unrepresentable" invariant against future edits.
  4. Content (wrapper): Not a discriminated union. It carries sources: list[ContentSource]; the inner type splits, the outer wrapper does not need to.

Quality gate

Interaction with #19

origin/develop already carries #19's _should_refetch helper (merged before this branch was rebased). The helper was adapted to the new variant shape:

return all(
    isinstance(src, ContentSourceFailure)
    and src.failure_reason in _TRANSIENT_FAILURES
    for src in external
)

Behaviour is preserved bit-for-bit, including the "uncategorised failures get one auto-retry" semantic — that case is now satisfied at the data layer (legacy records migrate None"timeout") rather than via a special branch in _should_refetch. All eight _should_refetch truth-table tests and the four fetch_pending integration tests pass after the rewrite.

Test plan

  • uv run poe check — all 10 critical checks green
  • Legacy items.json round-trip on real data (1884 items, 317 sources) — round-trip equality preserved
  • mypy probe rejects the four illegal annotations (verified by tests/test_type_safety.py)
  • All 346 tests pass, including the 13 new model / store / type-safety tests and the rebased fetch: retry transiently-failed items without --force #19 truth-table tests
  • Manual: run xbrain generate after upgrade against a real data/items.json, diff the wiki — byte-identical output expected

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

VGonPa and others added 3 commits May 22, 2026 19:03
Split ContentSource into ContentSourceSuccess and ContentSourceFailure
with an `outcome` discriminator. Same for the internal FetchResult,
now FetchSuccess | FetchFailure. The type system enforces that
success-only fields (text, title) and failure-only fields
(failure_reason, error) cannot coexist — illegal states are
unrepresentable.

Backward-compatible deserialisation: legacy items.json records
(pre-#20, with `ok: bool`) are normalised on read by a BeforeValidator
on the union. Uncategorised legacy failures (failure_reason=None)
bucket to "timeout" so #19's auto-retry gives them one chance to
land on a proper category.

mypy now catches missing required fields and reading the wrong
field off a narrowed variant — verified by a probe at
tests/type_probes/illegal_states.py and a regression test
that runs mypy on it.

Quality gate: 288 tests, 88% coverage, all critical checks pass.
Addresses the HIGH findings from the silent-failure-hunter on PR #29
(silent regression of #19's invariant). The other 5 reviewers approved
or flagged the same area as "doc-only"; this commit closes the real
behavioural gap.

The bug: `_content_source_from` and the bare-except in `fetch_item`
bucketed an uncategorised failure to `"empty_content"` (a TERMINAL
failure_reason). #19 guaranteed that a failure without a categorised
reason was retry-worthy ("anomalous failures get one auto-retry");
post-#20 those records became permanently stuck without `--force`. A
silent regression: tests pass, the wiki renders, but real-world
uncaught extractor exceptions self-heal no more.

Fix: new `FailureReason = "unknown_error"` Literal value, added to
`_TRANSIENT_FAILURES`. The two collapse sites and the legacy migrator
all bucket uncategorised failures here.

Changes:
- `src/xbrain/models.py`: `FailureReason` gains `"unknown_error"`.
  `_normalise_legacy_content_source` now buckets legacy `ok=False,
  failure_reason=None` records to `"unknown_error"` (was `"timeout"` —
  misleading because the actual cause was unknown, not a timeout) and
  emits a `logger.warning` per migrated record (silent-failure HIGH 2:
  the legacy bucketing was invisible). The "missing both discriminators"
  ValueError now includes the offending URL (silent-failure HIGH 1
  partial fix: better error context; per-item resilience is a separate
  larger change deferred to a follow-up issue).
- `src/xbrain/fetch.py`: `_TRANSIENT_FAILURES` gains `"unknown_error"`;
  the bare-except in `fetch_item` and the `_content_source_from`
  fallback both target `"unknown_error"` instead of `"empty_content"`;
  `FetchFailure` docstring updated.
- `src/xbrain/generate.py`: `_FAILURE_ES` gains the Spanish translation
  `"error desconocido"`.
- `tests/test_models.py`, `tests/test_fetch.py`: legacy-migration test
  updated to expect `unknown_error` instead of `timeout`. TWO new
  regression tests pin the new invariant:
  - `test_extractor_exception_persists_as_transient_failure`: bare-except
    persists `unknown_error` AND `_should_refetch` returns True.
  - `test_content_source_from_uncategorised_failure_is_transient`: the
    `FetchFailure → ContentSourceFailure` mapping uses `unknown_error`.

Skipped (low-priority, follow-up):
- Per-item resilience in load_store (HIGH 1 full fix): structural
  change, separate issue if real.
- M4 (generate.py silent elision of thread/quoted_tweet failures):
  pre-existing pattern, out of scope.
- python-code-reviewer M1/M2/M3: doc-tier polish.
- simplifier suggestions (drop pop() defensive, hoist local import in
  notes_io): doc/style polish.
- test-analyzer polish (3 model_copy / JSON round-trip assertions):
  improvements, not blockers.

Total: 348 tests (up from 346), coverage 89%, quality gate all-green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant