Add graded package-relevance eval harness to ir.eval (#66)#67
Merged
Conversation
Additive, offline, model-free graded scoreboard for topical-relevance runs — the shared measurement contract ef model bake-offs and the raglab labeling recipe both score against (tracking #61). - PackageRelevanceCase (frozen, JSONL round-trip) + RELEVANCE_LEVELS / LEVEL_GAINS (none/tangential=0, uses-tools=1, strong=2, core=3); save/load + read_package_meta. - to_graded_qrels: graded gains reach ef.evaluation nDCG (vs to_qrels' flat 1). - Named-set diagnostics: fp_rate_on_distractors / recall_on_hard_positives (pure), NamedSets + derive_named_sets, evaluate_named_sets -> auditable NamedSetReport. - compare_indexings -> ComparisonReport: graded nDCG@k + named-set metrics + per-named-id rank/score deltas; .to_dict() JSON-clean; .regressions() drives a pytest gate. Embedder-agnostic (an ef instruction-tuned or ir deps-as-text corpus is just another entry). - Public-name-only fixture + tests (16); full suite green (415). Closes #66. Claude-Session: https://claude.ai/code/session_01D229oNHVN1drd1mdbQL5MV
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds the graded package-relevance eval harness to
ir.eval— the Group-0 unblocker from the 231-package embeddings-vs-graphs run (tracking #61). It is the shared, offline, model-free scoreboard that theefmodel bake-offs (#34/#35) and theirindexing experiments (#62–#65) and theraglablabeling recipe (#8) all score against. Single-shot, no model, does not touch ranking → lands inirper #38.Why
ir.eval's only persistable unit wasDiscoveryCase(flat-binary gold;to_qrelshardcodes grade 1), so graded gains never reached nDCG and there was nowhere to record thatauis a known graphs distractor scored 0.51, or thatchromadolis a core embeddings positive scored 0.02. Aggregate nDCG hides exactly the two failure modes the run surfaced.What's added (all additive to
ir/eval.py)PackageRelevanceCase(frozen, JSONL round-trip) + module constantsRELEVANCE_LEVELS/LEVEL_GAINS(none/tangential=0,uses-tools=1,strong=2,core=3);save_package_cases/load_package_cases/read_package_meta;to_graded_qrelsfeedingef.evaluationgraded qrels with zeroefchange.fp_rate_on_distractors,recall_on_hard_positives(pure);NamedSets+ deterministicderive_named_sets;evaluate_named_sets→NamedSetReportthat reports the rate and the ids actually seen (auditable).compare_indexings(corpora, cases, *, themes, probes, k, named_sets)→ComparisonReportwith graded nDCG@k, the two named-set metrics, and per-named-id rank/score deltas..to_dict()is JSON-clean (qh/HTTP);.regressions(threshold)returns named packages that got worse → drives a pytest gate. Embedder-agnostic.Validation
Run end-to-end against the real frozen baseline (
all-MiniLM-L6-v2/ hybrid, stored privately): it reproduces the failure numerically — graphshard_positive_recall@20 = 0.00, embeddings0.40, distractorfp_rate@200.22/0.14 — while graded nDCG@20 stays high (0.74/0.80), demonstrating that aggregate nDCG hides the two failure modes.Privacy
The full 231-row labeling + frozen baseline live in a private repo (
thorwhalen/ir-eval-data). Only a tiny public-name-only fixture (networkx,graphviz,chromadb,sentence-transformers, …) is committed undertests/fixtures/, exercising the schema,to_graded_qrels, round-trip, and the metric/gate functions offline in CI.Tests
16 new tests in
tests/test_package_relevance.py(hermetic, light embedder); full suite green (415 passed). Ruff E/W/F/B clean at line-length 88.Closes #66. Part of #61.
https://claude.ai/code/session_01D229oNHVN1drd1mdbQL5MV