C — Claude Code sessions corpus: from_claude_sessions + ClaudeTurn strategy

## Summary

Add a corpus over the user's Claude Code CLI session transcripts, so an agent can
ask "when did I work on X / what did Claude conclude about Y" across past sessions.
This is a new source preset + indexing strategy under the roadmap (#1), reusing the
existing multi-surface model so the *same* corpus answers both "what I asked" and
"what the assistant answered".

## Source format (verified against ~1,614 real transcripts, ~1.1 GB)

- Transcripts are JSONL: `~/.claude/projects/<encoded-cwd>/<session-uuid>.jsonl`,
  one file per session.
- Line `type`s include `user`, `assistant`, plus non-conversational records
  (`ai-title`, `custom-title`, `system`, `file-history-snapshot`, ...).
- **No conversation summary/recap is persisted** — there is no `type: "summary"` and
  no compaction summary in the transcript. The only persisted recap is `ai-title`
  (a short AI-generated session title) and an optional `custom-title`. So segmentation
  cannot lean on stored recaps.
- An assistant "turn" is **fragmented across many lines** (one content block per line:
  `thinking` / `text` / `tool_use`, interleaved with `tool_result` user lines). The
  **final `text` block before the next real user message is the highest-signal content**
  ("here's what I did") — confirming the deliberation-is-noise hypothesis. Turn end is
  detectable via `stop_reason != "tool_use"` and/or the next non-`isMeta` `user` line.
- Useful filter fields per turn: `sessionId`, `cwd`/project, `gitBranch`, `timestamp`,
  `model`, `has_tool_use` (+ token `usage` when present).

## Design

**Parser lives in `priv`** (it owns `~/.claude` tooling; mirrors how
`CorpusSource.from_skills` already consumes `priv.skills_index`): add
`priv.claude_transcripts` yielding turn-pair records
`{id, user_prompt, assistant_summary, assistant_full, metadata}` with an injectable
fetcher for tests.

**`ir` side:**
- `CorpusSource.from_claude_sessions(*, since=None, projects=None, include_full=False,
  max_sessions=None, fetcher=None)` — lazy scope over turn pairs; `change_signal` =
  per-turn content hash (so a growing session re-indexes only its new turns,
  incrementally).
- `ClaudeTurn` indexing strategy → two surfaces by default:
  - `user_prompt` (what the human asked)
  - `assistant_summary` (final text block of the turn — the signal)
  - optional `assistant_full` surface behind `include_full=True` (off by default to
    avoid deliberation noise; opt-in for recall).
  `surfaces={"user_prompt"}` / `{"assistant_summary"}` then targets either side at query
  time — directly answering "sometimes I want what I asked, sometimes what was answered".
- Register a `sessions` preset (auto-registers with smart defaults like the others).

**AI synopses: not systematic.** Per-turn LLM synopsis would be far too expensive at
~0.5–1M turn pairs. `ir.with_synopsis` stays available as an opt-in wrapper, and the
per-corpus background-work policy in #D decides if/when synopses run (e.g. only on
recent sessions, during downtime). Default: none.

**Scale / selectivity.** ~0.5–1M candidate turn pairs corpus-wide. Smart defaults must
scope: default to a recent window (`since`) and the two lean surfaces; `projects=` and
`max_sessions=` narrow further. This is the first real customer of the per-corpus
parametrization in #D (segmentation/model/schedule/storage).

## Open questions to confirm during implementation
- Default `since` window (all-history first build is heavy) — propose last 90 days.
- Whether to also index `ai-title` as a cheap session-level surface (one per session).

Related: #1 (roadmap), depends on #D (per-corpus params), #56 (perf — a 0.5M+ corpus
makes the packed-matrix work load-bearing).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

C — Claude Code sessions corpus: from_claude_sessions + ClaudeTurn strategy #57

Summary

Source format (verified against ~1,614 real transcripts, ~1.1 GB)

Design

Open questions to confirm during implementation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

C — Claude Code sessions corpus: from_claude_sessions + ClaudeTurn strategy #57

Description

Summary

Source format (verified against ~1,614 real transcripts, ~1.1 GB)

Design

Open questions to confirm during implementation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions