Skip to content

Add graded package-relevance eval harness to ir.eval (#66)#67

Merged
thorwhalen merged 1 commit into
masterfrom
claude/issue-66-graded-eval-harness
Jun 18, 2026
Merged

Add graded package-relevance eval harness to ir.eval (#66)#67
thorwhalen merged 1 commit into
masterfrom
claude/issue-66-graded-eval-harness

Conversation

@thorwhalen

Copy link
Copy Markdown
Member

What

Adds the graded package-relevance eval harness to ir.eval — the Group-0 unblocker from the 231-package embeddings-vs-graphs run (tracking #61). It is the shared, offline, model-free scoreboard that the ef model bake-offs (#34/#35) and the ir indexing experiments (#62#65) and the raglab labeling recipe (#8) all score against. Single-shot, no model, does not touch ranking → lands in ir per #38.

Why

ir.eval's only persistable unit was DiscoveryCase (flat-binary gold; to_qrels hardcodes grade 1), so graded gains never reached nDCG and there was nowhere to record that au is a known graphs distractor scored 0.51, or that chromadol is a core embeddings positive scored 0.02. Aggregate nDCG hides exactly the two failure modes the run surfaced.

What's added (all additive to ir/eval.py)

  • SchemaPackageRelevanceCase (frozen, JSONL round-trip) + module constants RELEVANCE_LEVELS / LEVEL_GAINS (none/tangential=0, uses-tools=1, strong=2, core=3); save_package_cases / load_package_cases / read_package_meta; to_graded_qrels feeding ef.evaluation graded qrels with zero ef change.
  • Named-set metricsfp_rate_on_distractors, recall_on_hard_positives (pure); NamedSets + deterministic derive_named_sets; evaluate_named_setsNamedSetReport that reports the rate and the ids actually seen (auditable).
  • A/B gatecompare_indexings(corpora, cases, *, themes, probes, k, named_sets)ComparisonReport with graded nDCG@k, the two named-set metrics, and per-named-id rank/score deltas. .to_dict() is JSON-clean (qh/HTTP); .regressions(threshold) returns named packages that got worse → drives a pytest gate. Embedder-agnostic.

Validation

Run end-to-end against the real frozen baseline (all-MiniLM-L6-v2 / hybrid, stored privately): it reproduces the failure numerically — graphs hard_positive_recall@20 = 0.00, embeddings 0.40, distractor fp_rate@20 0.22/0.14 — while graded nDCG@20 stays high (0.74/0.80), demonstrating that aggregate nDCG hides the two failure modes.

Privacy

The full 231-row labeling + frozen baseline live in a private repo (thorwhalen/ir-eval-data). Only a tiny public-name-only fixture (networkx, graphviz, chromadb, sentence-transformers, …) is committed under tests/fixtures/, exercising the schema, to_graded_qrels, round-trip, and the metric/gate functions offline in CI.

Tests

16 new tests in tests/test_package_relevance.py (hermetic, light embedder); full suite green (415 passed). Ruff E/W/F/B clean at line-length 88.

Closes #66. Part of #61.

https://claude.ai/code/session_01D229oNHVN1drd1mdbQL5MV

Additive, offline, model-free graded scoreboard for topical-relevance runs —
the shared measurement contract ef model bake-offs and the raglab labeling
recipe both score against (tracking #61).

- PackageRelevanceCase (frozen, JSONL round-trip) + RELEVANCE_LEVELS / LEVEL_GAINS
  (none/tangential=0, uses-tools=1, strong=2, core=3); save/load + read_package_meta.
- to_graded_qrels: graded gains reach ef.evaluation nDCG (vs to_qrels' flat 1).
- Named-set diagnostics: fp_rate_on_distractors / recall_on_hard_positives (pure),
  NamedSets + derive_named_sets, evaluate_named_sets -> auditable NamedSetReport.
- compare_indexings -> ComparisonReport: graded nDCG@k + named-set metrics +
  per-named-id rank/score deltas; .to_dict() JSON-clean; .regressions() drives a
  pytest gate. Embedder-agnostic (an ef instruction-tuned or ir deps-as-text
  corpus is just another entry).
- Public-name-only fixture + tests (16); full suite green (415).

Closes #66.

Claude-Session: https://claude.ai/code/session_01D229oNHVN1drd1mdbQL5MV
@thorwhalen thorwhalen merged commit 5401a2c into master Jun 18, 2026
12 checks passed
@thorwhalen thorwhalen deleted the claude/issue-66-graded-eval-harness branch June 18, 2026 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant