Context
Thread: #61. ir.retrieve.search already has a rerank= hook (defaults None) that over-fetches fetch_k then delegates to ef.rerank via _apply_rerank (ir/retrieve.py:291,329), and ef.reranking already ships cross_encoder_reranker + the with_reranker over-fetch decorator (ef/reranking.py:129,177). But the two are not wired or calibrated on our data, so today's single-shot ir search ships the distractors near the top. Per the #38 decision rule this is legitimately ir's: a cross-encoder precision pass makes single-shot, agent-free search measurably better — it is NOT the agentic back-edge.
Problem (with our FP/FN evidence)
Many of our worst false positives are recall-stage artifacts a precision re-scorer should catch: au (#2, 0.51) is an async-task framework whose single "DAG" bullet inflates its pooled-vector cosine; su/csm/voxy/theremin and creek/strand/reci score high on shared jargon, not topical aboutness. Conversely, a cross-encoder reading the full README can rescue the thin-description cores imbed/meshed/linked that sank on thin text.
Proposal
In ir — document and test the rerank= path end-to-end on the Package strategy: ensure ir.retrieve.search(..., mode='hybrid', rerank=<ef cross-encoder>) over-fetches fetch_k then re-scores, and that the cross-encoder sees a topical pair built from the package's full surface text (name + description + readme_chunk), not just the thin description. Keep ef.reranking.cross_encoder_reranker as the implementation — no new ir reranker class (ir composes ef). Add an ir-side knob to choose the (query, passage) text the reranker is fed (default: the best surface; option: concatenated sibling surfaces via records_for_artifact). Reranker stays opt-in / offline-default so import ir needs no torch.
Experiment
On the 231-package labeled set (the private benchmark repo thorwhalen/ir-eval-data (access-controlled) — package_relevance_labels.jsonl (full 231-package graded gold labeling), named_sets.json (per-theme distractors + hard_positives), and benchmark_analysis.json (frozen all-MiniLM-L6-v2 baseline: precision@K = recall@K ≈ 0.42 for both themes). Clone with repo access; package names are not mirrored into this public repo.), compare three single-shot configs against the all-MiniLM hybrid baseline: (A) hybrid, no rerank; (B) hybrid + cross-encoder/ms-marco-MiniLM-L-6-v2 rerank over best-surface; (C) hybrid + cross-encoder rerank over concatenated surfaces (full README). Both theme probes. Track whether au/su/csm/creek drop out of top-20 and whether the thin-description cores imbed/meshed/linked survive — (B) should not push them down; (C) should pull them up. Report the latency delta (fetch_k=50 → rerank) to confirm it stays single-shot-cheap.
Success metric
- nDCG@20 + P@10 on graded labels.
- Named-distractor FP-rate@20 on
{au, strand, reci, creek, mv} (graphs) and {su, csm, voxy, theremin, haggle, farepy} (embeddings).
- Hard-positive recall@20 on
{imbed, meshed, linked, chromadol, kroki, xcosmo, lexis}.
- Success = (C) cuts named-distractor FP-rate@20 by ≥40% vs (A) while holding hard-positive recall@20 ≥ (A). Report latency.
Data
Full 231 graded cases + named distractor + thin-description-core lists from the private benchmark repo thorwhalen/ir-eval-data (access-controlled) — package_relevance_labels.jsonl (full 231-package graded gold labeling), named_sets.json (per-theme distractors + hard_positives), and benchmark_analysis.json (frozen all-MiniLM-L6-v2 baseline: precision@K = recall@K ≈ 0.42 for both themes). Clone with repo access; package names are not mirrored into this public repo.; the existing Package corpus (no re-embed).
https://claude.ai/code/session_01D229oNHVN1drd1mdbQL5MV
Context
Thread: #61.
ir.retrieve.searchalready has arerank=hook (defaultsNone) that over-fetchesfetch_kthen delegates toef.rerankvia_apply_rerank(ir/retrieve.py:291,329), andef.rerankingalready shipscross_encoder_reranker+ thewith_rerankerover-fetch decorator (ef/reranking.py:129,177). But the two are not wired or calibrated on our data, so today's single-shotirsearch ships the distractors near the top. Per the #38 decision rule this is legitimatelyir's: a cross-encoder precision pass makes single-shot, agent-free search measurably better — it is NOT the agentic back-edge.Problem (with our FP/FN evidence)
Many of our worst false positives are recall-stage artifacts a precision re-scorer should catch:
au(#2, 0.51) is an async-task framework whose single "DAG" bullet inflates its pooled-vector cosine;su/csm/voxy/thereminandcreek/strand/reciscore high on shared jargon, not topical aboutness. Conversely, a cross-encoder reading the full README can rescue the thin-description coresimbed/meshed/linkedthat sank on thin text.Proposal
In
ir— document and test thererank=path end-to-end on thePackagestrategy: ensureir.retrieve.search(..., mode='hybrid', rerank=<ef cross-encoder>)over-fetchesfetch_kthen re-scores, and that the cross-encoder sees a topical pair built from the package's full surface text (name + description + readme_chunk), not just the thin description. Keepef.reranking.cross_encoder_rerankeras the implementation — no new ir reranker class (ircomposesef). Add anir-side knob to choose the(query, passage)text the reranker is fed (default: the best surface; option: concatenated sibling surfaces viarecords_for_artifact). Reranker stays opt-in / offline-default soimport irneeds no torch.Experiment
On the 231-package labeled set (the private benchmark repo
thorwhalen/ir-eval-data(access-controlled) —package_relevance_labels.jsonl(full 231-package graded gold labeling),named_sets.json(per-themedistractors+hard_positives), andbenchmark_analysis.json(frozenall-MiniLM-L6-v2baseline: precision@K = recall@K ≈ 0.42 for both themes). Clone with repo access; package names are not mirrored into this public repo.), compare three single-shot configs against the all-MiniLM hybrid baseline: (A) hybrid, no rerank; (B) hybrid +cross-encoder/ms-marco-MiniLM-L-6-v2rerank over best-surface; (C) hybrid + cross-encoder rerank over concatenated surfaces (full README). Both theme probes. Track whetherau/su/csm/creekdrop out of top-20 and whether the thin-description coresimbed/meshed/linkedsurvive — (B) should not push them down; (C) should pull them up. Report the latency delta (fetch_k=50→ rerank) to confirm it stays single-shot-cheap.Success metric
{au, strand, reci, creek, mv}(graphs) and{su, csm, voxy, theremin, haggle, farepy}(embeddings).{imbed, meshed, linked, chromadol, kroki, xcosmo, lexis}.Data
Full 231 graded cases + named distractor + thin-description-core lists from the private benchmark repo
thorwhalen/ir-eval-data(access-controlled) —package_relevance_labels.jsonl(full 231-package graded gold labeling),named_sets.json(per-themedistractors+hard_positives), andbenchmark_analysis.json(frozenall-MiniLM-L6-v2baseline: precision@K = recall@K ≈ 0.42 for both themes). Clone with repo access; package names are not mirrored into this public repo.; the existing Package corpus (no re-embed).https://claude.ai/code/session_01D229oNHVN1drd1mdbQL5MV