Summary
There is no single place to parametrize a corpus's background work and its
build/storage/segmentation policy, nor smart defaults for new corpora. Today the
registry stores only {kind, embedder, params} and source_from_entry() cannot even
persist a strategy — it always reconstructs the preset default, so per-corpus
chunk_size, segmentation choices, storage backend, and any maintenance cadence are
unrepresentable. This issue adds a declarative per-corpus policy layer (registry v2) +
a defaults-per-kind rule set + an idempotent maintenance command.
Boundary respected: ir holds the declarative policy as data and exposes an idempotent
ir maintain; it does not run a scheduler. Execution (cron/launchd, or raglab's
budget governor / run-log per ADR #43) stays external. No Settings singleton — policy
is per-corpus data injected at build/maintain time (DI principle, semantic_search_design_notes §4.3).
ir must not import raglab (one-way dep).
Gap
ir/registry.py: entry = {kind, embedder, params}; source_from_entry drops strategy.
- No per-corpus: segmentation/strategy spec, storage backend choice, reindex condition,
synopsis policy, or "downtime" schedule.
- No documented defaults-per-kind (the three presets hardcode their own defaults).
Plan
- Registry v2 schema (back-compatible). Extend the entry to optionally carry:
{
kind, embedder, params,
strategy: {name, params} # persist & reconstruct the IndexingStrategy
storage: {backend, params} # default "local" file store; seam for #28 / packed-matrix
maintenance:{ reindex: {on: "source-change"|"interval", every: "24h"|null},
synopsis:{ enabled: false, scope: "recent"|..., window: "30d",
window_hours: ["02:00-06:00"] } }
}
v1 entries (no strategy/storage/maintenance) keep working unchanged.
source_from_entry learns to reconstruct a persisted strategy.
- Defaults-per-kind registry. A small pluggable table
kind -> recommended {strategy, embedder, storage, maintenance} so registering a new corpus (or a new
kind) starts from sensible, documented defaults instead of scattered constructor
defaults. New kinds register their defaults here.
ir maintain [name|--all] — reads each corpus's maintenance policy and does the
due work idempotently: incremental rebuild when the reindex condition is met, and
(only if synopsis.enabled and within the configured window) refresh synopses on the
in-scope slice. Safe to call from cron every N minutes; it no-ops when nothing is due.
Ships with a documented cron/launchd snippet (the "downtime hours" live in the policy;
the scheduler just calls ir maintain often).
- CLI/
register surface to set these without hand-editing JSON, plus ir info
showing the resolved policy.
Why this shape
Related: #1 (roadmap), #38 (epic — raglab consumes ir's registry view), ADR #43
(scheduling/run-log boundary), #28 (storage backend seam), #C (first consumer:
sessions corpus segmentation + recent-only synopsis-in-downtime).
Summary
There is no single place to parametrize a corpus's background work and its
build/storage/segmentation policy, nor smart defaults for new corpora. Today the
registry stores only
{kind, embedder, params}andsource_from_entry()cannot evenpersist a strategy — it always reconstructs the preset default, so per-corpus
chunk_size, segmentation choices, storage backend, and any maintenance cadence areunrepresentable. This issue adds a declarative per-corpus policy layer (registry v2) +
a defaults-per-kind rule set + an idempotent maintenance command.
Boundary respected:
irholds the declarative policy as data and exposes an idempotentir maintain; it does not run a scheduler. Execution (cron/launchd, orraglab'sbudget governor / run-log per ADR #43) stays external. No
Settingssingleton — policyis per-corpus data injected at build/maintain time (DI principle, semantic_search_design_notes §4.3).
irmust not importraglab(one-way dep).Gap
ir/registry.py: entry ={kind, embedder, params};source_from_entrydrops strategy.synopsis policy, or "downtime" schedule.
Plan
strategy/storage/maintenance) keep working unchanged.source_from_entrylearns to reconstruct a persisted strategy.kind -> recommended {strategy, embedder, storage, maintenance}so registering a new corpus (or a newkind) starts from sensible, documented defaults instead of scattered constructor
defaults. New kinds register their defaults here.
ir maintain [name|--all]— reads each corpus'smaintenancepolicy and does thedue work idempotently: incremental rebuild when the reindex condition is met, and
(only if
synopsis.enabledand within the configured window) refresh synopses on thein-scope slice. Safe to call from cron every N minutes; it no-ops when nothing is due.
Ships with a documented cron/launchd snippet (the "downtime hours" live in the policy;
the scheduler just calls
ir maintainoften).registersurface to set these without hand-editing JSON, plusir infoshowing the resolved policy.
Why this shape
ir build skills) — all of the above is optional withsmart defaults.
while keeping the runner external/idempotent so it composes with cron now and with
raglab's observability/budget layer later (ADR ADR: one linked-artifact substrate, three operators (traverse / expand / purpose-index) #43) withoutirimportingraglab.Related: #1 (roadmap), #38 (epic — raglab consumes ir's registry view), ADR #43
(scheduling/run-log boundary), #28 (storage backend seam), #C (first consumer:
sessions corpus segmentation + recent-only synopsis-in-downtime).