Skip to content

C — Claude Code sessions corpus: from_claude_sessions + ClaudeTurn strategy #57

Description

@thorwhalen

Summary

Add a corpus over the user's Claude Code CLI session transcripts, so an agent can
ask "when did I work on X / what did Claude conclude about Y" across past sessions.
This is a new source preset + indexing strategy under the roadmap (#1), reusing the
existing multi-surface model so the same corpus answers both "what I asked" and
"what the assistant answered".

Source format (verified against ~1,614 real transcripts, ~1.1 GB)

  • Transcripts are JSONL: ~/.claude/projects/<encoded-cwd>/<session-uuid>.jsonl,
    one file per session.
  • Line types include user, assistant, plus non-conversational records
    (ai-title, custom-title, system, file-history-snapshot, ...).
  • No conversation summary/recap is persisted — there is no type: "summary" and
    no compaction summary in the transcript. The only persisted recap is ai-title
    (a short AI-generated session title) and an optional custom-title. So segmentation
    cannot lean on stored recaps.
  • An assistant "turn" is fragmented across many lines (one content block per line:
    thinking / text / tool_use, interleaved with tool_result user lines). The
    final text block before the next real user message is the highest-signal content
    ("here's what I did") — confirming the deliberation-is-noise hypothesis. Turn end is
    detectable via stop_reason != "tool_use" and/or the next non-isMeta user line.
  • Useful filter fields per turn: sessionId, cwd/project, gitBranch, timestamp,
    model, has_tool_use (+ token usage when present).

Design

Parser lives in priv (it owns ~/.claude tooling; mirrors how
CorpusSource.from_skills already consumes priv.skills_index): add
priv.claude_transcripts yielding turn-pair records
{id, user_prompt, assistant_summary, assistant_full, metadata} with an injectable
fetcher for tests.

ir side:

  • CorpusSource.from_claude_sessions(*, since=None, projects=None, include_full=False, max_sessions=None, fetcher=None) — lazy scope over turn pairs; change_signal =
    per-turn content hash (so a growing session re-indexes only its new turns,
    incrementally).
  • ClaudeTurn indexing strategy → two surfaces by default:
    • user_prompt (what the human asked)
    • assistant_summary (final text block of the turn — the signal)
    • optional assistant_full surface behind include_full=True (off by default to
      avoid deliberation noise; opt-in for recall).
      surfaces={"user_prompt"} / {"assistant_summary"} then targets either side at query
      time — directly answering "sometimes I want what I asked, sometimes what was answered".
  • Register a sessions preset (auto-registers with smart defaults like the others).

AI synopses: not systematic. Per-turn LLM synopsis would be far too expensive at
~0.5–1M turn pairs. ir.with_synopsis stays available as an opt-in wrapper, and the
per-corpus background-work policy in #D decides if/when synopses run (e.g. only on
recent sessions, during downtime). Default: none.

Scale / selectivity. ~0.5–1M candidate turn pairs corpus-wide. Smart defaults must
scope: default to a recent window (since) and the two lean surfaces; projects= and
max_sessions= narrow further. This is the first real customer of the per-corpus
parametrization in #D (segmentation/model/schedule/storage).

Open questions to confirm during implementation

  • Default since window (all-history first build is heavy) — propose last 90 days.
  • Whether to also index ai-title as a cheap session-level surface (one per session).

Related: #1 (roadmap), depends on #D (per-corpus params), #56 (perf — a 0.5M+ corpus
makes the packed-matrix work load-bearing).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions