Skip to content

Evaluate unifying ir storage behind vd.Collection (from ir_04) — DEFERRED #28

Description

@thorwhalen

Refs #1, #12. Records the assessment of the ir_04 idea "unify ir storage behind vd.Collection to reuse vd.hybrid_search wholesale." Decision: defer (not skip). Verified against the actual vd and ir source.

Why defer — the headline payoff does not exist

The stated motivation was partly to fix the RRF score-magnitude loss ir_07 flagged as weakening hybrid abstention. But vd.hybrid_search's fallback path (vd/search.py _rrf_fuse) uses the same rank-based formula 1/(k+rank) as ir's own _rrf_fuse (ir/retrieve.py). ir_07 is explicit that the magnitude loss is structural to RRF, not an ir implementation defect — so delegating would reproduce the identical weakness. vd's native-hybrid path only triggers for server backends ir doesn't use; ir's tiny brute-force corpora always hit the same RRF fallback. Zero payoff on the stated reason.

Why defer — "reduce surface" is also weak

vd.Collection is a flat MutableMapping[str, Document] + search. It does not model the concerns that make up ir's retrieval surface:

  • per-artifact surface collapse (best_per_artifact) — ir indexes multiple surfaces per artifact and dedupes to best-per-artifact; vd has no such notion;
  • the calibrated-abstention score contract — SearchHit.score feeds abs_threshold / the conservative selector's min_score floor, and ir_07's calibration depends on the exact per-mode score semantics;
  • the dol 5-view persistence (meta/vectors/ledger/config/calibration);
  • ledger-driven incremental/idempotent build.

Wrapping CorpusStore as a vd.Collection would keep all of this and add an adapter layer — a net surface increase.

Why defer — migration cost is real and unrewarded

There is no dol/Files-backed vd backend (vd's persistent backends are faiss/qdrant/lancedb/sqlite_vec). Adopting existing persisted corpora (~/.local/share/ir/corpora/<name>) would require either re-indexing into an external backend (new heavy dep, loses the exact brute-force guarantee) or building a new dol-backed vd Collection adapter — i.e. building vd infra to host ir, the opposite of "reuse wholesale." At ir's scale (≤~250 artifacts) brute-force dense is exact and instant; there is no performance problem to solve.

ir already reuses vd where reuse is clean: vd.BM25Index (lexical), vd.reciprocal_rank_fusion (fusion), vd.filters.matches_filter (hard filter). That is the right reuse boundary.

The one genuinely valuable thread (split out)

The sub-goal that does address ir_07's abstention concern — a magnitude-preserving hybrid fusion (normalized dense+lexical score blend instead of pure RRF) — is independent of the storage refactor and is being prototyped in ir.retrieve directly. See #TBD (separate issue/PR).

Revisit if

ir's corpus scale grows past brute-force comfort, or vd ships a dol-backed Collection that preserves dense cosine magnitude in hybrid fusion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions