Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,7 @@ optional Firecrawl fallback, Playwright for x.com).
- **Reads:** `items.json`
- **Writes:** `items.json` — each item's `content` + `content_source[]`
- **Cached** — already-fetched items are skipped (use `--force` to refetch).
- **Transient retries** — items whose only previous failures were `timeout` / `dns_error` are re-fetched on the next run without `--force`. Terminal failures (`not_found`, `paywall`, `forbidden`, `js_required`, `empty_content`) stay skipped until `--force`.
- **Failures recorded as evidence** — `http_status` + `failure_reason`, never silently dropped.
- **Snapshots `data/` before `--force`** — recovery path if a forced refetch makes things worse.

Expand Down Expand Up @@ -394,6 +395,7 @@ These are the rules the rest of the architecture rests on. Breaking any of them
6. **`fetch` is cached per item id.** Re-runs do not re-hit the network without `--force` (or, in the future, transient-retry — issue #19).
7. **Operation names, not query ids.** The extractor anchors to X GraphQL operation names because X rotates the ids. Anything that hardcodes an id will break.
8. **Destructive ops are reversible.** Every command that overwrites a `data/` artifact (`vocab --regenerate`, `topics --resynth`, `fetch --force`) snapshots `data/` first to `data/snapshots/<ts>-pre-<command>/`. `xbrain snapshot restore <name>` is the recovery path. A snapshot failure aborts the destructive op.
9. **Fetch records are tagged unions.** A `ContentSource` on `items.json` is either a `Success` (with required `text`) or a `Failure` (with required `failure_reason`). Mixed shapes are not representable — pydantic rejects them at construction, and mypy rejects them statically (via the `pydantic.mypy` plugin). Legacy records with `ok: bool` (pre-#20) are normalised on read by a `BeforeValidator` on the union, so existing `data/items.json` files keep working without a manual migration. The static contract is pinned by `tests/type_probes/illegal_states.py`.

---

Expand Down Expand Up @@ -432,6 +434,7 @@ xbrain/
│ ├── notes_io.py ← per-note read/write + user-tail preservation
│ ├── store.py ← items.json / topics.json / state.json I/O
│ ├── snapshot.py ← data/ snapshot lifecycle (create/list/restore/prune)
│ ├── diff.py ← structured diff between two snapshot data dirs
│ ├── worksheet.py ← enrich worksheet export/import
│ ├── validate.py ← guardrails enforcement
│ ├── llm_json.py ← extract JSON from LLM responses
Expand Down
23 changes: 22 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,11 @@ Code Is Cheap Now. Software Isn't. https://t.co/J9m5RzQNbW
Everything above the `xbrain:generated` marker is regenerated on every run;
anything *you* write below it is preserved.

> Set `[output] topic_style = "hashtag"` in `config.toml` to render the
> in-body `**Topics:**` line as `#ai-coding #software-engineering` instead of
> wikilinks — useful if you navigate primarily via Obsidian's tag pane. The
> frontmatter `tags:` are native Obsidian tags in either mode.

### Layer 2 — Topics

The layer that makes XBrain more than a tidy backup. **A topic page is not a
Expand Down Expand Up @@ -377,6 +382,7 @@ resynth_threshold = 25 # re-synthesise an overview after N ne

[output]
language = "English" # English | Spanish
topic_style = "wikilink" # wikilink | hashtag (in-body Topics: line)
```

| Section | Key | Default | Purpose |
Expand All @@ -390,6 +396,7 @@ language = "English" # English | Spanish
| `[vocab]` | `target_count` | `30` | Number of topics the `vocab` stage induces. |
| `[topics]` | `resynth_threshold` | `25` | Post growth that marks a topic overview stale. |
| `[output]` | `language` | `English` | Output language for LLM summaries/overviews AND wiki section headers. `English` or `Spanish`. |
| `[output]` | `topic_style` | `wikilink` | How the in-body `**Topics:**` line is rendered: `wikilink` (`[[slug]] · [[slug]]`) or `hashtag` (`#slug #slug`). Frontmatter `tags:` are unaffected. |

Switching `[output].language` after the corpus is already enriched is supported
— but does not retroactively translate existing summaries. To convert the
Expand Down Expand Up @@ -508,14 +515,15 @@ uv run xbrain <command> [options]
|---------|-------------|
| `extract` | Extract bookmarks and/or own tweets from X. `--source bookmarks\|tweets\|all`. |
| `import-archive <zip>` | Backfill the full own-tweet history from the official X data archive. |
| `fetch` | Download linked article content, expand threads, fetch linked X content. `--force` re-fetches everything. |
| `fetch` | Download linked article content, expand threads, fetch linked X content. By default, items whose only previous failures were transient (`timeout`, `dns_error`) are re-fetched automatically; terminal failures (`not_found`, `paywall`, `forbidden`, `js_required`, `empty_content`) stay skipped until `--force`. `--force` re-fetches every external_article source regardless of state. |
| `vocab` | Induce the topic taxonomy. `--executor`, `--apply <file>`, `--regenerate`. |
| `enrich` | Enrich items with a summary + topics. `--executor`, `--apply <file>`. |
| `topics` | Synthesise topic pages. `--executor`, `--apply <file>`, `--resynth`. |
| `generate` | Render the wiki into the vault. |
| `sync` | `extract` + `fetch` + `generate`, in order. |
| `status` | Counts and last-run timestamps. |
| `snapshot` | Manage `data/` snapshots: `create`, `list`, `show`, `restore`, `prune`. See [Snapshots & safety](#snapshots--safety). |
| `diff` | Compare two snapshots (or one snapshot vs. the live `data/`). Surfaces reassigned items, topic growth, overview drift, vocab changes. `--format text\|json`. |
| `login` | Open a browser to log in to X (see [Authentication](#authentication) — prefer the cookie import). |

Every stage accepts `--since` / `--until` (ISO dates) to narrow the date window.
Expand Down Expand Up @@ -544,6 +552,19 @@ The Obsidian vault is **not** snapshotted — it is fully derived from `data/`
via `xbrain generate`. `restore` rolls back `data/`; you run `xbrain generate`
to rebuild the wiki from it.

After a destructive run, `xbrain diff <pre-snapshot>` shows exactly what
moved — items whose `primary_topic` was reassigned, topic memberships that
grew or shrank, overview text that drifted, and vocab slugs added or
removed. The B side defaults to the live `data/`, so the common case is one
short command. Add `--format json` to pipe the report into `jq` or a CI
gate.

```bash
xbrain diff 2026-05-22T18-30-15Z-pre-vocab-regenerate # vs. live data/
xbrain diff <snap-a> <snap-b> # two named snapshots
xbrain diff <snap-a> --format json | jq '.summary.reassigned_pct'
```

---

## Execution modes
Expand Down
5 changes: 5 additions & 0 deletions config.toml.example
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,8 @@ resynth_threshold = 25
# "English" - default
# "Spanish"
language = "English"
# How the in-body `**Topics:**` line is rendered on each item note.
# Frontmatter `tags:` are unaffected by this setting. Supported values:
# "wikilink" - **Topics:** [[ai-coding]] · [[software-engineering]] (default)
# "hashtag" - **Topics:** #ai-coding #software-engineering
topic_style = "wikilink"
11 changes: 11 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,17 @@ target-version = "py312"
python_version = "3.12"
ignore_missing_imports = true
files = ["src/xbrain"]
# Pydantic mypy plugin turns model __init__ into a typed signature so
# mypy can flag missing required fields (e.g. ContentSourceSuccess
# without `text`). The #20 refactor relies on this to make illegal
# states unrepresentable at the type level — verified by the
# tests/type_probes/illegal_states.py probe.
plugins = ["pydantic.mypy"]

[tool.pydantic-mypy]
init_forbid_extra = true
init_typed = true
warn_required_dynamic_aliases = true

[tool.interrogate]
ignore-init-method = true
Expand Down
54 changes: 53 additions & 1 deletion src/xbrain/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from xbrain import snapshot
from xbrain.archive import parse_archive
from xbrain.config import Config, load_config
from xbrain.diff import diff_snapshots, format_json, format_text
from xbrain.enrich import apply_worksheet_judgments, enrich_with_executor, items_pending_enrichment
from xbrain.executors.api import ApiExecutor
from xbrain.extract.browser import login as run_login
Expand Down Expand Up @@ -191,7 +192,7 @@ def _run_fetch(cfg: Config, since: datetime | None, until: datetime | None, forc

def _run_generate(cfg: Config, since: datetime | None, until: datetime | None) -> None:
store = load_store(cfg.items_path)
run_generate(store, cfg.output_dir, since, until, cfg.output_language)
run_generate(store, cfg.output_dir, since, until, cfg.output_language, cfg.topic_style)
typer.echo(f"Markdown generado en {cfg.output_dir}")


Expand Down Expand Up @@ -549,5 +550,56 @@ def snapshot_prune_cmd(
typer.echo(f"Snapshots deleted: {deleted}")


def _resolve_data_dir(cfg: Config, name: str | None) -> Path:
"""Resolve a snapshot name to its data dir, or `None` to the live `data/`.

`xbrain diff` accepts a snapshot name (resolved via `snapshot_show`) OR
`None` to mean "the current live `data/`" — the most common B-side of the
comparison the user runs after a destructive op.
"""
if name is None:
return cfg.data_dir
snapshot_dir, _ = snapshot.snapshot_show(cfg.data_dir, name)
return snapshot_dir


@app.command()
@_handle_cli_errors
def diff(
snapshot_a: str = typer.Argument(..., help="Snapshot name on the A side."),
snapshot_b: str | None = typer.Argument(
None,
help="Snapshot name on the B side. Defaults to the live data/ directory.",
),
output_format: str = typer.Option(
"text",
"--format",
help="Output format: 'text' (default) or 'json'.",
),
) -> None:
"""Compare two snapshots and surface drift.

Reports reassigned items, topic-membership shifts, topic-overview drift
(TF cosine similarity) and vocab changes. The B side defaults to the live
`data/` directory so `xbrain diff <pre-snapshot>` answers "what did the
last destructive op move?" with no extra arguments.
"""
cfg = _config()
if output_format not in ("text", "json"):
raise ValueError(f"--format must be 'text' or 'json', got {output_format!r}")
a_dir = _resolve_data_dir(cfg, snapshot_a)
b_dir = _resolve_data_dir(cfg, snapshot_b)
report = diff_snapshots(a_dir, b_dir)
if output_format == "json":
typer.echo(format_json(report))
else:
b_label = snapshot_b if snapshot_b is not None else "live data/"
typer.echo("Comparing:")
typer.echo(f" A: {snapshot_a}")
typer.echo(f" B: {b_label}")
typer.echo("")
typer.echo(format_text(report))


if __name__ == "__main__":
app()
16 changes: 15 additions & 1 deletion src/xbrain/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@
from xbrain.i18n import strings_for
from xbrain.models import ExecutorName

# In-body `**Topics:**` line styles. `wikilink` (default) keeps the current
# navigation-first behaviour; `hashtag` emits Obsidian tags so the line pivots
# into the tag pane. Frontmatter `tags:` are unaffected by this toggle.
SUPPORTED_TOPIC_STYLES: tuple[str, ...] = ("wikilink", "hashtag")


@dataclass(frozen=True)
class Config:
Expand All @@ -23,6 +28,7 @@ class Config:
vocab_target_count: int
topics_resynth_threshold: int
output_language: str # one of xbrain.i18n.SUPPORTED_LANGUAGES
topic_style: str # one of xbrain.config.SUPPORTED_TOPIC_STYLES

@property
def items_path(self) -> Path:
Expand Down Expand Up @@ -64,10 +70,17 @@ def load_config(repo_root: Path) -> Config:
resynth_threshold = int(topics.get("resynth_threshold", 25))
if resynth_threshold < 1:
raise ValueError("config.toml: [topics].resynth_threshold must be >= 1")
output_language = settings.get("output", {}).get("language", "English")
output = settings.get("output", {})
output_language = output.get("language", "English")
# Validate via strings_for: it already raises ValueError listing supported
# languages on an unknown value. Single source of truth for the check.
strings_for(output_language)
topic_style = output.get("topic_style", "wikilink")
if topic_style not in SUPPORTED_TOPIC_STYLES:
raise ValueError(
f"config.toml: [output].topic_style must be one of "
f"{list(SUPPORTED_TOPIC_STYLES)}, got {topic_style!r}"
)
return Config(
repo_root=repo_root,
vault=vault,
Expand All @@ -79,4 +92,5 @@ def load_config(repo_root: Path) -> Config:
vocab_target_count=target_count,
topics_resynth_threshold=resynth_threshold,
output_language=output_language,
topic_style=topic_style,
)
Loading
Loading