Skip to content

Retrieval-quality + version A/B evaluation harness (recall@k / MRR / nDCG) #98

Description

@dcfocus

Summary

Add an evaluation harness that measures retrieval quality (recall@k / MRR / nDCG) of search / retrieve against a labeled query set, and that compares quality across dataset versions or configs (embedding model, distance metric, index, hybrid fusion). Retrieval is lance-context's core competency, so retrieval-quality eval belongs next to it — and the versioned store makes cross-version regression comparison possible in a way a stateless vector DB can't.

Motivation

Users already tune knobs that affect retrieval — distance metric (#74 / #77), embedding dimension (#78), hybrid fusion weights (#72 / #76) — but there is no supported way to measure whether a change actually helps. Today you would hand-roll search calls into a spreadsheet. There is no metric implementation, no labeled-set runner, and no way to A/B two versions or configs.

Proposed work

  1. Eval dataset format — labeled query → relevant record ids (with optional graded relevance), loadable from JSONL / Arrow. Reference records by stable external_id.
  2. Metrics — recall@k, precision@k, MRR, nDCG@k, hit-rate — computed over both search (vector) and retrieve (hybrid).
  3. Runner — run a query set against a context at a pinned version, with configurable k, filters, and lifecycle options; return per-query and aggregate scores.
  4. Version / config A-B — run the same query set against two versions (via checkout) or two configs and report per-metric deltas, for regression detection across embedding-model / index / metric / fusion changes.
  5. Report — aggregate table + per-query breakdown + a manifest (context version, config, query-set id) for reproducibility.
  6. Examples + docs — evaluate a context; A/B a new embedding model against the prior version.

Why here

  • Retrieval is the core API (search_filtered_with_options, store.rs:972; retrieve_filtered_with_options, store.rs:1003) — eval belongs alongside it.
  • Versioning + checkout (store.rs:785) enables cross-version regression eval ("does version N+1 retrieve better than N on the same query set?"), which is the differentiator over a stateless vector DB.

Non-goals

  • Does not evaluate a trained model or downstream task quality — that is a trainer / eval-harness concern. Scope is retrieval and data quality only.
  • Bundles no embedding models; uses the existing provider registry (feat: pluggable embedding provider registry #87) or precomputed embeddings.

Acceptance criteria

  • A user can run a labeled query set against a context and get recall@k / MRR / nDCG for both vector and hybrid retrieval.
  • A user can A/B two versions (or two configs) on the same query set and get per-metric deltas.
  • Results are pinned to a context version + config in a reproducible report.
  • Tests cover metric correctness against known fixtures, filter/lifecycle handling, and version A-B.

Notes

Builds on existing retrieval, versioning, configurable distance metric (#74 / #77), hybrid retrieval (#72 / #76), configurable embedding dimension (#78), and the embedding provider registry (#87). The labeled eval/holdout set overlaps the decontamination holdout in #96 and the deferred train/eval split — same artifact, different use — so the eval-set format should be shared with the post-training pipeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions