Add graded package-relevance eval harness to ir.eval (#66) by thorwhalen · Pull Request #67 · i2mint/ir

thorwhalen · 2026-06-18T13:16:18Z

What

Adds the graded package-relevance eval harness to ir.eval — the Group-0 unblocker from the 231-package embeddings-vs-graphs run (tracking #61). It is the shared, offline, model-free scoreboard that the ef model bake-offs (#34/#35) and the ir indexing experiments (#62–#65) and the raglab labeling recipe (#8) all score against. Single-shot, no model, does not touch ranking → lands in ir per #38.

Why

ir.eval's only persistable unit was DiscoveryCase (flat-binary gold; to_qrels hardcodes grade 1), so graded gains never reached nDCG and there was nowhere to record that au is a known graphs distractor scored 0.51, or that chromadol is a core embeddings positive scored 0.02. Aggregate nDCG hides exactly the two failure modes the run surfaced.

What's added (all additive to `ir/eval.py`)

Schema — PackageRelevanceCase (frozen, JSONL round-trip) + module constants RELEVANCE_LEVELS / LEVEL_GAINS (none/tangential=0, uses-tools=1, strong=2, core=3); save_package_cases / load_package_cases / read_package_meta; to_graded_qrels feeding ef.evaluation graded qrels with zero ef change.
Named-set metrics — fp_rate_on_distractors, recall_on_hard_positives (pure); NamedSets + deterministic derive_named_sets; evaluate_named_sets → NamedSetReport that reports the rate and the ids actually seen (auditable).
A/B gate — compare_indexings(corpora, cases, *, themes, probes, k, named_sets) → ComparisonReport with graded nDCG@k, the two named-set metrics, and per-named-id rank/score deltas. .to_dict() is JSON-clean (qh/HTTP); .regressions(threshold) returns named packages that got worse → drives a pytest gate. Embedder-agnostic.

Validation

Run end-to-end against the real frozen baseline (all-MiniLM-L6-v2 / hybrid, stored privately): it reproduces the failure numerically — graphs hard_positive_recall@20 = 0.00, embeddings 0.40, distractor fp_rate@20 0.22/0.14 — while graded nDCG@20 stays high (0.74/0.80), demonstrating that aggregate nDCG hides the two failure modes.

Privacy

The full 231-row labeling + frozen baseline live in a private repo (thorwhalen/ir-eval-data). Only a tiny public-name-only fixture (networkx, graphviz, chromadb, sentence-transformers, …) is committed under tests/fixtures/, exercising the schema, to_graded_qrels, round-trip, and the metric/gate functions offline in CI.

Tests

16 new tests in tests/test_package_relevance.py (hermetic, light embedder); full suite green (415 passed). Ruff E/W/F/B clean at line-length 88.

Closes #66. Part of #61.

https://claude.ai/code/session_01D229oNHVN1drd1mdbQL5MV

Additive, offline, model-free graded scoreboard for topical-relevance runs — the shared measurement contract ef model bake-offs and the raglab labeling recipe both score against (tracking #61). - PackageRelevanceCase (frozen, JSONL round-trip) + RELEVANCE_LEVELS / LEVEL_GAINS (none/tangential=0, uses-tools=1, strong=2, core=3); save/load + read_package_meta. - to_graded_qrels: graded gains reach ef.evaluation nDCG (vs to_qrels' flat 1). - Named-set diagnostics: fp_rate_on_distractors / recall_on_hard_positives (pure), NamedSets + derive_named_sets, evaluate_named_sets -> auditable NamedSetReport. - compare_indexings -> ComparisonReport: graded nDCG@k + named-set metrics + per-named-id rank/score deltas; .to_dict() JSON-clean; .regressions() drives a pytest gate. Embedder-agnostic (an ef instruction-tuned or ir deps-as-text corpus is just another entry). - Public-name-only fixture + tests (16); full suite green (415). Closes #66. Claude-Session: https://claude.ai/code/session_01D229oNHVN1drd1mdbQL5MV

thorwhalen merged commit 5401a2c into master Jun 18, 2026
12 checks passed

thorwhalen deleted the claude/issue-66-graded-eval-harness branch June 18, 2026 13:18

thorwhalen mentioned this pull request Jun 18, 2026

Tracking: 231-package relevance run — turn the FP/FN findings into ranking + eval improvements across ir / ef / raglab #61

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add graded package-relevance eval harness to ir.eval (#66)#67

Add graded package-relevance eval harness to ir.eval (#66)#67
thorwhalen merged 1 commit into
masterfrom
claude/issue-66-graded-eval-harness

thorwhalen commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

thorwhalen commented Jun 18, 2026

What

Why

What's added (all additive to ir/eval.py)

Validation

Privacy

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

What's added (all additive to `ir/eval.py`)