Skip to content

D — per-corpus background-work & defaults parametrization (registry v2 + ir maintain) #58

Description

@thorwhalen

Summary

There is no single place to parametrize a corpus's background work and its
build/storage/segmentation policy, nor smart defaults for new corpora. Today the
registry stores only {kind, embedder, params} and source_from_entry() cannot even
persist a strategy
— it always reconstructs the preset default, so per-corpus
chunk_size, segmentation choices, storage backend, and any maintenance cadence are
unrepresentable. This issue adds a declarative per-corpus policy layer (registry v2) +
a defaults-per-kind rule set + an idempotent maintenance command.

Boundary respected: ir holds the declarative policy as data and exposes an idempotent
ir maintain; it does not run a scheduler.
Execution (cron/launchd, or raglab's
budget governor / run-log per ADR #43) stays external. No Settings singleton — policy
is per-corpus data injected at build/maintain time (DI principle, semantic_search_design_notes §4.3).
ir must not import raglab (one-way dep).

Gap

  • ir/registry.py: entry = {kind, embedder, params}; source_from_entry drops strategy.
  • No per-corpus: segmentation/strategy spec, storage backend choice, reindex condition,
    synopsis policy, or "downtime" schedule.
  • No documented defaults-per-kind (the three presets hardcode their own defaults).

Plan

  1. Registry v2 schema (back-compatible). Extend the entry to optionally carry:
    {
      kind, embedder, params,
      strategy:   {name, params}        # persist & reconstruct the IndexingStrategy
      storage:    {backend, params}     # default "local" file store; seam for #28 / packed-matrix
      maintenance:{ reindex: {on: "source-change"|"interval", every: "24h"|null},
                    synopsis:{ enabled: false, scope: "recent"|..., window: "30d",
                               window_hours: ["02:00-06:00"] } }
    }
    
    v1 entries (no strategy/storage/maintenance) keep working unchanged.
    source_from_entry learns to reconstruct a persisted strategy.
  2. Defaults-per-kind registry. A small pluggable table kind -> recommended {strategy, embedder, storage, maintenance} so registering a new corpus (or a new
    kind) starts from sensible, documented defaults instead of scattered constructor
    defaults. New kinds register their defaults here.
  3. ir maintain [name|--all] — reads each corpus's maintenance policy and does the
    due work idempotently: incremental rebuild when the reindex condition is met, and
    (only if synopsis.enabled and within the configured window) refresh synopses on the
    in-scope slice. Safe to call from cron every N minutes; it no-ops when nothing is due.
    Ships with a documented cron/launchd snippet (the "downtime hours" live in the policy;
    the scheduler just calls ir maintain often).
  4. CLI/register surface to set these without hand-editing JSON, plus ir info
    showing the resolved policy.

Why this shape

Related: #1 (roadmap), #38 (epic — raglab consumes ir's registry view), ADR #43
(scheduling/run-log boundary), #28 (storage backend seam), #C (first consumer:
sessions corpus segmentation + recent-only synopsis-in-downtime).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions