Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 32 additions & 2 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,35 @@ The numbered stages above are summarised; the sections below cover each one in d

**Snapshot trigger.** `xbrain media` always snapshots `data/` first (label `pre-media`), mirroring the destructive-op recovery boundary. The snapshot covers `items.json` / `state.json` / `vocab.yaml` / `topics.json` only — the binary photo bytes under `data/media/` are NOT included; re-downloading via `xbrain media` is the recovery path.

### describe

**What it does.** Sends every downloaded photo to a Claude vision model, asks for a 1-3 sentence prose description plus a `is_decorative` classification, and persists the prose on the entry. The entry transitions from `MediaPhotoDownloaded` to `MediaPhotoDescribed` (a new variant on the `MediaEntry` union). Decorative photos (avatars, reaction memes, abstract backgrounds) are classified as such with an empty description so downstream prompts can filter them out without re-classifying.

**Reads.** `data/items.json` + `data/media/<id>/<n>.<ext>` (the bytes the downloader wrote).

**Writes.** `data/items.json` — each described photo entry carries `is_decorative` + `description` + `description_lang` + `description_version` + `described_at`. No new on-disk binary state; the bytes from the prior `MediaPhotoDownloaded` are inherited verbatim.

**State machine.** Each `xbrain describe` run advances eligible photo entries:
- `Downloaded` → `Described` (description on the entry, bytes unchanged).
- `Described` (stale version OR stale language) → `Described` (current version + current language), automatically.
- `Described` (current version + current language) → no-op (skipped) unless `--force`.

Eligibility ignores `Pending` / `Failed` / `VideoPending`: describe only runs on photos with bytes on disk. The description-version tag is the rubric-evolution lever: bumping `[describe].version` in `config.toml` invalidates persisted entries so the next run re-describes them without `--force`. The `description_lang` check is the mixed-vault guard: switching `[paths].output_language` from Spanish → English (or back) marks every previously-described entry stale so the enrich prompt never splices the wrong-language prose into a new vault.

**Batching.** Default batch size is 5 images per API call (the spec's quality / cost sweet spot — ~12-15 % token saving vs per-image, modest added complexity). Override with `--batch-size N`.

**Refusals.** Vision refusals (faces, NSFW) are NOT a hard failure: the entry is persisted as decorative with an empty description, and the run continues. The same `is_decorative` flag downstream consumers already use for "no topic signal" handles the refusal uniformly.

**Failure isolation.** Per-batch error isolation: one failing API call does not abort the run. A total-failure run (every batch errored) raises `RuntimeError` so the CLI surfaces non-zero exit. The orchestrator's `on_progress` callback writes `items.json` between batches so Ctrl-C mid-run leaves the store coherent — same recovery contract as `media`.

**Snapshot trigger.** `xbrain describe` always snapshots `data/` first (label `pre-describe`), mirroring `media`'s recovery boundary. A botched run — wrong model, runaway prompt — can be undone with `xbrain snapshot restore`.

**Feeds the LLM stages.** Once described, the prose is consumed automatically:
- `xbrain enrich` (in `executors/api.py:_user_prompt`) splices an `Images in this post:` section between the post body and the links/article block when the item has content-bearing described photos. Decoratives are filtered.
- `xbrain topics` (in `topic_synth.py:_user_prompt`) appends the flat list of content-bearing image descriptions across every post in a topic, after the per-post summaries.

This is how a tweet that is mostly a screenshot of a paper becomes searchable by what the screenshot was actually about.

### vocab

**What it does.** Induces a closed taxonomy of ~30-45 topics from the whole corpus. Map step: chunks the corpus, asks an LLM to propose candidate topics per chunk. Reduce step: asks the LLM to consolidate the union of candidates down to `vocab.target_count` topics. Always includes a `misc` topic for posts with no thematic core.
Expand Down Expand Up @@ -333,7 +362,7 @@ Everything XBrain knows lives in four files inside `data/` (gitignored). They ar

| File | Format | What it is | Mutated by |
|------|--------|------------|------------|
| `items.json` | JSON array of `Item` | The source of truth — every post XBrain has ever seen, with all fetched content and enrichment | `extract`, `fetch`, `enrich`, `media` |
| `items.json` | JSON array of `Item` | The source of truth — every post XBrain has ever seen, with all fetched content, enrichment, and per-photo vision descriptions | `extract`, `fetch`, `enrich`, `media`, `describe` |
| `state.json` | JSON | Extractor cursors (`last_seen_id`, `last_run`) per source, archive-import marker | `extract`, `import-archive` |
| `vocab.yaml` | YAML list of `Topic` | The controlled topic taxonomy — closed list of slugs + descriptions | `vocab` |
| `topics.json` | JSON dict of `TopicPage` | The synthesized topic-page overviews and notes, keyed by slug | `topics` |
Expand All @@ -355,6 +384,7 @@ The LLM-driven stages (`vocab`, `enrich`, `topics`) do not have their instructio
| `rubric-topics.md` | `enrich` | Assign one `primary_topic` + 0-3 secondaries from the closed vocab. Never invent slugs |
| `rubric-summary.md` | `enrich` | Write a 1-3 sentence summary, faithful to the post and the fetched article, no hallucination |
| `rubric-topic-page.md` | `topics` | Synthesize 1-3 paragraphs of plain prose + up to 15 short notes per topic, zero wikilinks |
| `rubric-describe-image.md` | `describe` | Classify each photo as decorative vs content-bearing and describe content-bearing ones in 1-3 sentences. Refusals fall through as decorative with empty description |

**Why a separate file per rubric.** Changing how XBrain summarizes posts is editing one markdown file, not chasing a string through the codebase. The rubric is the *contract* between code and LLM; the code only handles structure, transport and validation.

Expand Down Expand Up @@ -438,7 +468,7 @@ These are the rules the rest of the architecture rests on. Breaking any of them
7. **Operation names, not query ids.** The extractor anchors to X GraphQL operation names because X rotates the ids. Anything that hardcodes an id will break.
8. **Destructive ops are reversible.** Every command that overwrites a `data/` artifact (`vocab --regenerate`, `topics --resynth`, `fetch --force`) snapshots `data/` first to `data/snapshots/<ts>-pre-<command>/`. `xbrain snapshot restore <name>` is the recovery path. A snapshot failure aborts the destructive op.
9. **Fetch records are tagged unions.** A `ContentSource` on `items.json` is either a `Success` (with required `text`) or a `Failure` (with required `failure_reason`). Mixed shapes are not representable — pydantic rejects them at construction, and mypy rejects them statically (via the `pydantic.mypy` plugin). Legacy records with `ok: bool` (pre-#20) are normalised on read by a `BeforeValidator` on the union, so existing `data/items.json` files keep working without a manual migration. The static contract is pinned by `tests/type_probes/illegal_states.py`.
10. **Media variants are mutually exclusive states.** A `MediaEntry` on `items.json` is one of `MediaPhotoPending` / `MediaPhotoDownloaded` / `MediaPhotoFailed` / `MediaVideoPending`, discriminated by `kind`. State transitions happen only via `xbrain media`. Legacy records with the flat `{type, url}` shape are normalised on read by a `BeforeValidator` on the union — no manual migration needed. (See the `### media` section above for the retry contract and storage layout.)
10. **Media variants are mutually exclusive states.** A `MediaEntry` on `items.json` is one of `MediaPhotoPending` / `MediaPhotoDownloaded` / `MediaPhotoFailed` / `MediaPhotoDescribed` / `MediaVideoPending`, discriminated by `kind`. The photo states form a linear pipeline: `Pending → Downloaded → Described` (with `Failed` as the off-ramp from `Pending`). State transitions happen only via `xbrain media` (advances `Pending` and retries `Failed`) and `xbrain describe` (advances `Downloaded` to `Described`). Legacy records with the flat `{type, url}` shape are normalised on read by a `BeforeValidator` on the union — no manual migration needed. (See the `### media` and `### describe` sections above for the per-stage contracts.)

---

Expand Down
29 changes: 28 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -397,6 +397,8 @@ topic_style = "wikilink" # wikilink | hashtag (in-body Topics:
| `[topics]` | `resynth_threshold` | `25` | Post growth that marks a topic overview stale. |
| `[output]` | `language` | `English` | Output language for LLM summaries/overviews AND wiki section headers. `English` or `Spanish`. |
| `[output]` | `topic_style` | `wikilink` | How the in-body `**Topics:**` line is rendered: `wikilink` (`[[slug]] · [[slug]]`) or `hashtag` (`#slug #slug`). Frontmatter `tags:` are unaffected. |
| `[describe]` | `model` | `claude-sonnet-4-6` | Vision model for `xbrain describe`. Override per run with `--model`. |
| `[describe]` | `version` | `v1` | Tag persisted on every described photo. Bumping invalidates existing descriptions so the next `xbrain describe` re-describes stale entries. |

Switching `[output].language` after the corpus is already enriched is supported
— but does not retroactively translate existing summaries. To convert the
Expand Down Expand Up @@ -517,6 +519,7 @@ uv run xbrain <command> [options]
| `import-archive <zip>` | Backfill the full own-tweet history from the official X data archive. |
| `fetch` | Download linked article content, expand threads, fetch linked X content. By default, items whose only previous failures were transient (`timeout`, `dns_error`) are re-fetched automatically; terminal failures (`not_found`, `paywall`, `forbidden`, `js_required`, `empty_content`) stay skipped until `--force`. `--force` re-fetches every external_article source regardless of state. |
| `media` | Download X-post photos referenced in `Item.media` and render them inline in the wiki. `--force`, `--limit N`, `--items <a,b,c>`, `--verbose`. See [Local media storage](#local-media-storage). |
| `describe` | Describe downloaded photos with a vision LLM (Claude Sonnet 4.6 by default) and feed the prose into `enrich` + `topics`. `--force`, `--limit N`, `--items <a,b,c>`, `--model`, `--batch-size`, `--verbose`. Idempotent — re-runs skip already-described photos unless `[describe].version` is bumped in `config.toml`. |
| `vocab` | Induce the topic taxonomy. `--executor`, `--apply <file>`, `--regenerate`. |
| `enrich` | Enrich items with a summary + topics. `--executor`, `--apply <file>`. |
| `topics` | Synthesise topic pages. `--executor`, `--apply <file>`, `--resynth`. |
Expand Down Expand Up @@ -583,7 +586,31 @@ Failures are categorised on the item itself
the next `xbrain media` run.

Run `xbrain diff <snapshot>` after a media run to see how many photos
moved from `pending` / `failed` into `downloaded`.
moved from `pending` / `failed` into `downloaded` (or, after `xbrain
describe`, into `described`).

**Vision descriptions**

Once `xbrain media` has the bytes on disk, `xbrain describe` runs every
photo through Claude vision and stores a short prose description on
the entry (transitioning `MediaPhotoDownloaded` → `MediaPhotoDescribed`).
Descriptions are 1-3 sentences, faithful, in the configured
`output_language`. Decorative photos (avatars, reaction memes,
abstract backgrounds) are classified as such and persisted with an
empty description so they introduce no topic noise downstream.

`xbrain enrich` and `xbrain topics` consume the descriptions
automatically: an item with content-bearing photos gets an
`Images in this post:` block in the enrichment prompt; topic-page
synthesis sees the flat list of content image descriptions across the
topic's posts. This is how a tweet that is mostly a screenshot of a
paper becomes searchable by what the screenshot was actually about.

Describing the full corpus costs about $3-5 with the default model
(Sonnet 4.6, 5 images per call). Bump `[describe].version` in
`config.toml` to invalidate stored descriptions when you change the
rubric — the next `xbrain describe` run re-describes stale entries
automatically without `--force`.

---

Expand Down
11 changes: 11 additions & 0 deletions config.toml.example
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,14 @@ language = "English"
# "wikilink" - **Topics:** [[ai-coding]] · [[software-engineering]] (default)
# "hashtag" - **Topics:** #ai-coding #software-engineering
topic_style = "wikilink"

[describe]
# Vision model used by `xbrain describe`. Sonnet 4.6 is the default —
# the quality / cost sweet spot (~$3-5 for a 2k-image corpus). Override
# per run with `--model` while iterating; the CLI flag wins.
model = "claude-sonnet-4-6"
# Description-version tag persisted on every described photo. Bumping
# this value invalidates existing descriptions: the next `xbrain
# describe` run re-describes stale entries automatically. Use it when
# you change the describe-image rubric or expectations.
version = "v1"
128 changes: 128 additions & 0 deletions src/xbrain/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@
from xbrain import snapshot
from xbrain.archive import parse_archive
from xbrain.config import Config, load_config
from xbrain.describe import describe_all as run_describe_all
from xbrain.describe import emit_summary_line as describe_emit_summary_line
from xbrain.diff import diff_snapshots, format_json, format_text
from xbrain.enrich import apply_worksheet_judgments, enrich_with_executor, items_pending_enrichment
from xbrain.executors.api import ApiExecutor
Expand Down Expand Up @@ -359,6 +361,132 @@ def media(
_run_media(cfg, force=force, limit=limit, items_filter=items_filter, verbose=verbose)


def _run_describe(
cfg: Config,
*,
force: bool,
limit: int | None,
items_filter: list[str] | None,
model: str,
batch_size: int,
verbose: bool,
) -> None:
"""Run the vision-describe orchestrator and persist after every batch.

Always snapshots `data/` first (the same recovery boundary as
`xbrain media`): a botched run — a wrong model, a runaway prompt
— can be undone with `xbrain snapshot restore`. Coherence on a
Ctrl-C mid-run is held by the outer `try/finally` below, which
saves the store unconditionally even when the orchestrator raises;
the `on_progress` callback is for incremental persistence between
batches on a clean run (so a long describe run never loses more
than one batch of work to a process death).
"""
if items_filter:
target = set(items_filter)
store_ids = set(load_store(cfg.items_path))
missing = target - store_ids
if missing and not (target & store_ids):
typer.echo(
f"AVISO: --items {','.join(items_filter)} no coincide con ningún item "
f"del store ({len(store_ids)} items). El run será un no-op.",
err=True,
)
_auto_snapshot(cfg, "describe")
store = load_store(cfg.items_path)

def _persist() -> None:
save_store(store, cfg.items_path)

try:
report = run_describe_all(
store,
cfg.media_dir,
model=model,
output_language=cfg.output_language,
description_version=cfg.describe_version,
force=force,
limit=limit,
items_filter=items_filter,
batch_size=batch_size,
on_progress=_persist,
)
finally:
# Persist whatever transitioned, even if `describe_all` raised. A
# RuntimeError on total failure must not discard the per-photo
# MediaPhotoDescribed records that landed before the raise.
save_store(store, cfg.items_path)
describe_emit_summary_line(report)
typer.echo(
f"Describe: descritas {report.photos_described}, "
f"fallidas {report.photos_failed}, "
f"saltadas {report.photos_skipped_already_described}"
)
if verbose and report.per_item_failures:
typer.echo("Failed photos:", err=True)
for item_id, failures in sorted(report.per_item_failures.items()):
for url, error in failures:
typer.echo(f" {item_id} {url} {error}", err=True)


@app.command()
@_handle_cli_errors
def describe(
force: bool = typer.Option(
False,
"--force",
help="Re-describir todas las fotos, incluso las ya descritas en la versión actual.",
),
limit: int | None = typer.Option(
None,
"--limit",
help="Máximo número de fotos a describir en esta ejecución.",
),
items: str | None = typer.Option(
None,
"--items",
help="IDs de items separados por comas para limitar el alcance del run.",
),
model: str | None = typer.Option(
None,
"--model",
help="Modelo de visión a usar. Si no se pasa, se usa el del config (`describe.model`).",
),
batch_size: int = typer.Option(
5,
"--batch-size",
min=1,
help="Número de imágenes por llamada a la API. 5 es el sweet spot (12-15%% ahorro de tokens).",
),
verbose: bool = typer.Option(
False,
"--verbose",
help="Imprime cada foto fallida (item_id, URL, error) al final del run.",
),
) -> None:
"""Describe las fotos descargadas con un LLM de visión.

Solo describe fotos con bytes en disco (`MediaPhotoDownloaded`).
Las entradas ya descritas en la versión actual se saltan; bumpear
`[describe].version` en `config.toml` fuerza un re-describe
automático sin `--force`. Las descripciones se persisten en
`items.json` y son consumidas por `xbrain enrich` y `xbrain topics`
en las llamadas LLM subsiguientes.
"""
cfg = _config()
items_filter = [s.strip() for s in items.split(",") if s.strip()] if items else None
chosen_model = model or cfg.describe_model
_run_describe(
cfg,
force=force,
limit=limit,
items_filter=items_filter,
model=chosen_model,
batch_size=batch_size,
verbose=verbose,
)


@app.command()
@_handle_cli_errors
def enrich(
Expand Down
14 changes: 14 additions & 0 deletions src/xbrain/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,17 @@ class Config:
topics_resynth_threshold: int
output_language: str # one of xbrain.i18n.SUPPORTED_LANGUAGES
topic_style: str # one of xbrain.config.SUPPORTED_TOPIC_STYLES
# `describe_model` defaults to Sonnet 4.6 — the spec settled on it as the
# quality / cost sweet spot for vision (~$3-5 for a 2k-image corpus).
# Override per run via `xbrain describe --model ...` when iterating on
# prompt or budget; the CLI flag wins over the config value.
describe_model: str
# `describe_version` tags every produced description so a prompt
# evolution can be rolled out incrementally: bumping the value here
# makes the next `xbrain describe` run re-describe stale entries
# automatically (no `--force` needed). The string is exact-match —
# there is no ordering relation, only equality.
describe_version: str

@property
def items_path(self) -> Path:
Expand Down Expand Up @@ -95,6 +106,7 @@ def load_config(repo_root: Path) -> Config:
f"config.toml: [output].topic_style must be one of "
f"{list(SUPPORTED_TOPIC_STYLES)}, got {topic_style!r}"
)
describe = settings.get("describe", {})
return Config(
repo_root=repo_root,
vault=vault,
Expand All @@ -107,4 +119,6 @@ def load_config(repo_root: Path) -> Config:
topics_resynth_threshold=resynth_threshold,
output_language=output_language,
topic_style=topic_style,
describe_model=describe.get("model", "claude-sonnet-4-6"),
describe_version=describe.get("version", "v1"),
)
Loading
Loading