[#20] refactor: ContentSource/FetchResult as tagged unions by VGonPa · Pull Request #29 · VGonPa/xbrain

VGonPa · 2026-05-22T17:04:42Z

Summary

Refactor ContentSource and FetchResult from "single class with optional fields" into pydantic discriminated unions, so success and failure are mutually exclusive variants enforced by both the runtime (pydantic) and the type system (mypy with the pydantic.mypy plugin).

ContentSource = Annotated[Union[ContentSourceSuccess, ContentSourceFailure], BeforeValidator(...), Field(discriminator="outcome")]
FetchResult = FetchSuccess | FetchFailure (in-memory only, not persisted)
Wire-compatible discriminator: outcome: "success" | "failure"
Backward-compatible deserialisation: legacy ok: bool records in existing data/items.json are normalised on read by a BeforeValidator on the union — no manual migration command required.
Every consumer (fetch.py, fetch_x.py, extract/threads.py, generate.py, worksheet.py, executors/api.py, notes_io.py) now switches on the variant via isinstance instead of reading optional sentinel fields.

Closes #20.

Decisions on the four open questions (from the PRD)

Discriminator field name: outcome ("success" / "failure"). kind was taken by ContentSource.kind (external_article / x_article / …); ok as a literal was ambiguous next to its legacy boolean meaning. outcome is self-documenting in the JSON.
JSON wire compat for legacy records: A BeforeValidator wraps the discriminated union. It maps ok=True → outcome="success" (and drops failure-only fields), ok=False → outcome="failure" (and drops success-only fields). Legacy records with ok=False, failure_reason=None (one such record in Víctor's data/items.json — an HTTP 429) bucket to failure_reason="timeout", which preserves the fetch: retry transiently-failed items without --force #19 "auto-retry uncategorised failures" behaviour without an extra branch in _should_refetch. Round-trip verified against the real 1884-item corpus: 274 success + 43 failure sources, all migrated.
mypy enforcement: Enabled pydantic.mypy plugin (with init_typed = true, init_forbid_extra = true). A tests/type_probes/illegal_states.py probe contains four annotations that MUST be mypy errors; tests/test_type_safety.py shells out to mypy and asserts all four fire. This pins the "illegal states unrepresentable" invariant against future edits.
Content (wrapper): Not a discriminated union. It carries sources: list[ContentSource]; the inner type splits, the outer wrapper does not need to.

Quality gate

Tests: 346 pass (up from 288 pre-fetch: retry transiently-failed items without --force #19, plus the post-rebase fetch: retry transiently-failed items without --force #19 truth-table tests rewritten to use the new variants).
Coverage: 89% (above 78% threshold).
Mypy: clean on src/xbrain (28 source files).
Ruff, ruff-format, bandit, detect-secrets, interrogate: PASS.
Vulture, radon: PASS / WARN (no new C grades; the existing one carries over).
Deptry: WARN-only (pre-existing browser_cookie3 note).

Interaction with #19

origin/develop already carries #19's _should_refetch helper (merged before this branch was rebased). The helper was adapted to the new variant shape:

return all(
    isinstance(src, ContentSourceFailure)
    and src.failure_reason in _TRANSIENT_FAILURES
    for src in external
)

Behaviour is preserved bit-for-bit, including the "uncategorised failures get one auto-retry" semantic — that case is now satisfied at the data layer (legacy records migrate None → "timeout") rather than via a special branch in _should_refetch. All eight _should_refetch truth-table tests and the four fetch_pending integration tests pass after the rewrite.

Test plan

uv run poe check — all 10 critical checks green
Legacy items.json round-trip on real data (1884 items, 317 sources) — round-trip equality preserved
mypy probe rejects the four illegal annotations (verified by tests/test_type_safety.py)
All 346 tests pass, including the 13 new model / store / type-safety tests and the rebased fetch: retry transiently-failed items without --force #19 truth-table tests
Manual: run xbrain generate after upgrade against a real data/items.json, diff the wiki — byte-identical output expected

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

Split ContentSource into ContentSourceSuccess and ContentSourceFailure with an `outcome` discriminator. Same for the internal FetchResult, now FetchSuccess | FetchFailure. The type system enforces that success-only fields (text, title) and failure-only fields (failure_reason, error) cannot coexist — illegal states are unrepresentable. Backward-compatible deserialisation: legacy items.json records (pre-#20, with `ok: bool`) are normalised on read by a BeforeValidator on the union. Uncategorised legacy failures (failure_reason=None) bucket to "timeout" so #19's auto-retry gives them one chance to land on a proper category. mypy now catches missing required fields and reading the wrong field off a narrowed variant — verified by a probe at tests/type_probes/illegal_states.py and a regression test that runs mypy on it. Quality gate: 288 tests, 88% coverage, all critical checks pass.

Addresses the HIGH findings from the silent-failure-hunter on PR #29 (silent regression of #19's invariant). The other 5 reviewers approved or flagged the same area as "doc-only"; this commit closes the real behavioural gap. The bug: `_content_source_from` and the bare-except in `fetch_item` bucketed an uncategorised failure to `"empty_content"` (a TERMINAL failure_reason). #19 guaranteed that a failure without a categorised reason was retry-worthy ("anomalous failures get one auto-retry"); post-#20 those records became permanently stuck without `--force`. A silent regression: tests pass, the wiki renders, but real-world uncaught extractor exceptions self-heal no more. Fix: new `FailureReason = "unknown_error"` Literal value, added to `_TRANSIENT_FAILURES`. The two collapse sites and the legacy migrator all bucket uncategorised failures here. Changes: - `src/xbrain/models.py`: `FailureReason` gains `"unknown_error"`. `_normalise_legacy_content_source` now buckets legacy `ok=False, failure_reason=None` records to `"unknown_error"` (was `"timeout"` — misleading because the actual cause was unknown, not a timeout) and emits a `logger.warning` per migrated record (silent-failure HIGH 2: the legacy bucketing was invisible). The "missing both discriminators" ValueError now includes the offending URL (silent-failure HIGH 1 partial fix: better error context; per-item resilience is a separate larger change deferred to a follow-up issue). - `src/xbrain/fetch.py`: `_TRANSIENT_FAILURES` gains `"unknown_error"`; the bare-except in `fetch_item` and the `_content_source_from` fallback both target `"unknown_error"` instead of `"empty_content"`; `FetchFailure` docstring updated. - `src/xbrain/generate.py`: `_FAILURE_ES` gains the Spanish translation `"error desconocido"`. - `tests/test_models.py`, `tests/test_fetch.py`: legacy-migration test updated to expect `unknown_error` instead of `timeout`. TWO new regression tests pin the new invariant: - `test_extractor_exception_persists_as_transient_failure`: bare-except persists `unknown_error` AND `_should_refetch` returns True. - `test_content_source_from_uncategorised_failure_is_transient`: the `FetchFailure → ContentSourceFailure` mapping uses `unknown_error`. Skipped (low-priority, follow-up): - Per-item resilience in load_store (HIGH 1 full fix): structural change, separate issue if real. - M4 (generate.py silent elision of thread/quoted_tweet failures): pre-existing pattern, out of scope. - python-code-reviewer M1/M2/M3: doc-tier polish. - simplifier suggestions (drop pop() defensive, hoist local import in notes_io): doc/style polish. - test-analyzer polish (3 model_copy / JSON round-trip assertions): improvements, not blockers. Total: 348 tests (up from 346), coverage 89%, quality gate all-green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

VGonPa and others added 3 commits May 22, 2026 19:03

[#20] style: ruff format post-rebase

fac3739

VGonPa merged commit 6d555eb into develop May 22, 2026
1 check passed

This was referenced May 22, 2026

Refactor: FetchResult / ContentSource as tagged unions #20

Closed

Wave: #19 transient retry + #13 hashtags + #18 xbrain diff + #20 tagged unions #30

Merged

[#24] Narrow bare except in ApiExecutor + synthesize_overviews_api; doc xbrain diff #31

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#20] refactor: ContentSource/FetchResult as tagged unions#29

[#20] refactor: ContentSource/FetchResult as tagged unions#29
VGonPa merged 3 commits into
developfrom
ws-20-tagged-unions

VGonPa commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

VGonPa commented May 22, 2026

Summary

Decisions on the four open questions (from the PRD)

Quality gate

Interaction with #19

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant