You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add an evaluation harness that measures retrieval quality (recall@k / MRR / nDCG) of search / retrieve against a labeled query set, and that compares quality across dataset versions or configs (embedding model, distance metric, index, hybrid fusion). Retrieval is lance-context's core competency, so retrieval-quality eval belongs next to it — and the versioned store makes cross-version regression comparison possible in a way a stateless vector DB can't.
Motivation
Users already tune knobs that affect retrieval — distance metric (#74 / #77), embedding dimension (#78), hybrid fusion weights (#72 / #76) — but there is no supported way to measure whether a change actually helps. Today you would hand-roll search calls into a spreadsheet. There is no metric implementation, no labeled-set runner, and no way to A/B two versions or configs.
Proposed work
Eval dataset format — labeled query → relevant record ids (with optional graded relevance), loadable from JSONL / Arrow. Reference records by stable external_id.
Metrics — recall@k, precision@k, MRR, nDCG@k, hit-rate — computed over both search (vector) and retrieve (hybrid).
Runner — run a query set against a context at a pinned version, with configurable k, filters, and lifecycle options; return per-query and aggregate scores.
Version / config A-B — run the same query set against two versions (via checkout) or two configs and report per-metric deltas, for regression detection across embedding-model / index / metric / fusion changes.
Report — aggregate table + per-query breakdown + a manifest (context version, config, query-set id) for reproducibility.
Examples + docs — evaluate a context; A/B a new embedding model against the prior version.
Why here
Retrieval is the core API (search_filtered_with_options, store.rs:972; retrieve_filtered_with_options, store.rs:1003) — eval belongs alongside it.
Versioning + checkout (store.rs:785) enables cross-version regression eval ("does version N+1 retrieve better than N on the same query set?"), which is the differentiator over a stateless vector DB.
Non-goals
Does not evaluate a trained model or downstream task quality — that is a trainer / eval-harness concern. Scope is retrieval and data quality only.
A user can run a labeled query set against a context and get recall@k / MRR / nDCG for both vector and hybrid retrieval.
A user can A/B two versions (or two configs) on the same query set and get per-metric deltas.
Results are pinned to a context version + config in a reproducible report.
Tests cover metric correctness against known fixtures, filter/lifecycle handling, and version A-B.
Notes
Builds on existing retrieval, versioning, configurable distance metric (#74 / #77), hybrid retrieval (#72 / #76), configurable embedding dimension (#78), and the embedding provider registry (#87). The labeled eval/holdout set overlaps the decontamination holdout in #96 and the deferred train/eval split — same artifact, different use — so the eval-set format should be shared with the post-training pipeline.
Summary
Add an evaluation harness that measures retrieval quality (recall@k / MRR / nDCG) of
search/retrieveagainst a labeled query set, and that compares quality across dataset versions or configs (embedding model, distance metric, index, hybrid fusion). Retrieval is lance-context's core competency, so retrieval-quality eval belongs next to it — and the versioned store makes cross-version regression comparison possible in a way a stateless vector DB can't.Motivation
Users already tune knobs that affect retrieval — distance metric (#74 / #77), embedding dimension (#78), hybrid fusion weights (#72 / #76) — but there is no supported way to measure whether a change actually helps. Today you would hand-roll
searchcalls into a spreadsheet. There is no metric implementation, no labeled-set runner, and no way to A/B two versions or configs.Proposed work
query → relevant record ids(with optional graded relevance), loadable from JSONL / Arrow. Reference records by stableexternal_id.search(vector) andretrieve(hybrid).version, with configurablek, filters, and lifecycle options; return per-query and aggregate scores.checkout) or two configs and report per-metric deltas, for regression detection across embedding-model / index / metric / fusion changes.Why here
search_filtered_with_options,store.rs:972;retrieve_filtered_with_options,store.rs:1003) — eval belongs alongside it.checkout(store.rs:785) enables cross-version regression eval ("does version N+1 retrieve better than N on the same query set?"), which is the differentiator over a stateless vector DB.Non-goals
Acceptance criteria
Notes
Builds on existing retrieval, versioning, configurable distance metric (#74 / #77), hybrid retrieval (#72 / #76), configurable embedding dimension (#78), and the embedding provider registry (#87). The labeled eval/holdout set overlaps the decontamination holdout in #96 and the deferred train/eval split — same artifact, different use — so the eval-set format should be shared with the post-training pipeline.