Refs #1, #12. Records the assessment of the ir_04 idea "unify ir storage behind vd.Collection to reuse vd.hybrid_search wholesale." Decision: defer (not skip). Verified against the actual vd and ir source.
Why defer — the headline payoff does not exist
The stated motivation was partly to fix the RRF score-magnitude loss ir_07 flagged as weakening hybrid abstention. But vd.hybrid_search's fallback path (vd/search.py _rrf_fuse) uses the same rank-based formula 1/(k+rank) as ir's own _rrf_fuse (ir/retrieve.py). ir_07 is explicit that the magnitude loss is structural to RRF, not an ir implementation defect — so delegating would reproduce the identical weakness. vd's native-hybrid path only triggers for server backends ir doesn't use; ir's tiny brute-force corpora always hit the same RRF fallback. Zero payoff on the stated reason.
Why defer — "reduce surface" is also weak
vd.Collection is a flat MutableMapping[str, Document] + search. It does not model the concerns that make up ir's retrieval surface:
- per-artifact surface collapse (
best_per_artifact) — ir indexes multiple surfaces per artifact and dedupes to best-per-artifact; vd has no such notion;
- the calibrated-abstention score contract —
SearchHit.score feeds abs_threshold / the conservative selector's min_score floor, and ir_07's calibration depends on the exact per-mode score semantics;
- the dol 5-view persistence (meta/vectors/ledger/config/calibration);
- ledger-driven incremental/idempotent build.
Wrapping CorpusStore as a vd.Collection would keep all of this and add an adapter layer — a net surface increase.
Why defer — migration cost is real and unrewarded
There is no dol/Files-backed vd backend (vd's persistent backends are faiss/qdrant/lancedb/sqlite_vec). Adopting existing persisted corpora (~/.local/share/ir/corpora/<name>) would require either re-indexing into an external backend (new heavy dep, loses the exact brute-force guarantee) or building a new dol-backed vd Collection adapter — i.e. building vd infra to host ir, the opposite of "reuse wholesale." At ir's scale (≤~250 artifacts) brute-force dense is exact and instant; there is no performance problem to solve.
ir already reuses vd where reuse is clean: vd.BM25Index (lexical), vd.reciprocal_rank_fusion (fusion), vd.filters.matches_filter (hard filter). That is the right reuse boundary.
The one genuinely valuable thread (split out)
The sub-goal that does address ir_07's abstention concern — a magnitude-preserving hybrid fusion (normalized dense+lexical score blend instead of pure RRF) — is independent of the storage refactor and is being prototyped in ir.retrieve directly. See #TBD (separate issue/PR).
Revisit if
ir's corpus scale grows past brute-force comfort, or vd ships a dol-backed Collection that preserves dense cosine magnitude in hybrid fusion.
Refs #1, #12. Records the assessment of the ir_04 idea "unify ir storage behind
vd.Collectionto reusevd.hybrid_searchwholesale." Decision: defer (not skip). Verified against the actualvdandirsource.Why defer — the headline payoff does not exist
The stated motivation was partly to fix the RRF score-magnitude loss
ir_07flagged as weakening hybrid abstention. Butvd.hybrid_search's fallback path (vd/search.py_rrf_fuse) uses the same rank-based formula1/(k+rank)as ir's own_rrf_fuse(ir/retrieve.py).ir_07is explicit that the magnitude loss is structural to RRF, not an ir implementation defect — so delegating would reproduce the identical weakness.vd's native-hybrid path only triggers for server backends ir doesn't use; ir's tiny brute-force corpora always hit the same RRF fallback. Zero payoff on the stated reason.Why defer — "reduce surface" is also weak
vd.Collectionis a flatMutableMapping[str, Document]+ search. It does not model the concerns that make up ir's retrieval surface:best_per_artifact) — ir indexes multiple surfaces per artifact and dedupes to best-per-artifact; vd has no such notion;SearchHit.scorefeedsabs_threshold/ the conservative selector'smin_scorefloor, andir_07's calibration depends on the exact per-mode score semantics;Wrapping
CorpusStoreas avd.Collectionwould keep all of this and add an adapter layer — a net surface increase.Why defer — migration cost is real and unrewarded
There is no dol/Files-backed
vdbackend (vd's persistent backends are faiss/qdrant/lancedb/sqlite_vec). Adopting existing persisted corpora (~/.local/share/ir/corpora/<name>) would require either re-indexing into an external backend (new heavy dep, loses the exact brute-force guarantee) or building a new dol-backed vd Collection adapter — i.e. building vd infra to host ir, the opposite of "reuse wholesale." At ir's scale (≤~250 artifacts) brute-force dense is exact and instant; there is no performance problem to solve.ir already reuses
vdwhere reuse is clean:vd.BM25Index(lexical),vd.reciprocal_rank_fusion(fusion),vd.filters.matches_filter(hard filter). That is the right reuse boundary.The one genuinely valuable thread (split out)
The sub-goal that does address
ir_07's abstention concern — a magnitude-preserving hybrid fusion (normalized dense+lexical score blend instead of pure RRF) — is independent of the storage refactor and is being prototyped inir.retrievedirectly. See #TBD (separate issue/PR).Revisit if
ir's corpus scale grows past brute-force comfort, or
vdships a dol-backedCollectionthat preserves dense cosine magnitude in hybrid fusion.