Retrieval-quality + version A/B evaluation harness (recall@k / MRR / nDCG)

## Summary

Add an evaluation harness that measures **retrieval quality** (recall@k / MRR / nDCG) of `search` / `retrieve` against a labeled query set, and that **compares quality across dataset versions or configs** (embedding model, distance metric, index, hybrid fusion). Retrieval is lance-context's core competency, so retrieval-quality eval belongs next to it — and the versioned store makes cross-version regression comparison possible in a way a stateless vector DB can't.

## Motivation

Users already tune knobs that affect retrieval — distance metric (#74 / #77), embedding dimension (#78), hybrid fusion weights (#72 / #76) — but there is no supported way to measure whether a change actually helps. Today you would hand-roll `search` calls into a spreadsheet. There is no metric implementation, no labeled-set runner, and no way to A/B two versions or configs.

## Proposed work

1. **Eval dataset format** — labeled `query → relevant record ids` (with optional graded relevance), loadable from JSONL / Arrow. Reference records by stable `external_id`.
2. **Metrics** — recall@k, precision@k, MRR, nDCG@k, hit-rate — computed over both `search` (vector) and `retrieve` (hybrid).
3. **Runner** — run a query set against a context at a pinned `version`, with configurable `k`, filters, and lifecycle options; return per-query and aggregate scores.
4. **Version / config A-B** — run the same query set against two versions (via `checkout`) or two configs and report per-metric deltas, for regression detection across embedding-model / index / metric / fusion changes.
5. **Report** — aggregate table + per-query breakdown + a manifest (context version, config, query-set id) for reproducibility.
6. **Examples + docs** — evaluate a context; A/B a new embedding model against the prior version.

## Why here

- Retrieval is the core API (`search_filtered_with_options`, `store.rs:972`; `retrieve_filtered_with_options`, `store.rs:1003`) — eval belongs alongside it.
- Versioning + `checkout` (`store.rs:785`) enables **cross-version regression eval** ("does version N+1 retrieve better than N on the same query set?"), which is the differentiator over a stateless vector DB.

## Non-goals

- Does **not** evaluate a trained model or downstream task quality — that is a trainer / eval-harness concern. Scope is retrieval and data quality only.
- Bundles no embedding models; uses the existing provider registry (#87) or precomputed embeddings.

## Acceptance criteria

- A user can run a labeled query set against a context and get recall@k / MRR / nDCG for both vector and hybrid retrieval.
- A user can A/B two versions (or two configs) on the same query set and get per-metric deltas.
- Results are pinned to a context version + config in a reproducible report.
- Tests cover metric correctness against known fixtures, filter/lifecycle handling, and version A-B.

## Notes

Builds on existing retrieval, versioning, configurable distance metric (#74 / #77), hybrid retrieval (#72 / #76), configurable embedding dimension (#78), and the embedding provider registry (#87). The labeled eval/holdout set overlaps the decontamination holdout in #96 and the deferred train/eval split — same artifact, different use — so the eval-set format should be shared with the post-training pipeline.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retrieval-quality + version A/B evaluation harness (recall@k / MRR / nDCG) #98

Summary

Motivation

Proposed work

Why here

Non-goals

Acceptance criteria

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Retrieval-quality + version A/B evaluation harness (recall@k / MRR / nDCG) #98

Description

Summary

Motivation

Proposed work

Why here

Non-goals

Acceptance criteria

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions