Wire ef's cross-encoder reranker as a calibrated precision stage through ir's single-shot rerank hook (audio/DAG distractor suppression)

## Context
Thread: i2mint/ir#61. `ir.retrieve.search` already has a `rerank=` hook (defaults `None`) that over-fetches `fetch_k` then delegates to `ef.rerank` via `_apply_rerank` (`ir/retrieve.py:291,329`), and `ef.reranking` already ships `cross_encoder_reranker` + the `with_reranker` over-fetch decorator (`ef/reranking.py:129,177`). But the two are not wired or calibrated on our data, so today's single-shot `ir` search ships the distractors near the top. Per the #38 decision rule this is legitimately `ir`'s: a cross-encoder precision pass makes single-shot, agent-free search measurably better — it is NOT the agentic back-edge.

## Problem (with our FP/FN evidence)
Many of our worst false positives are recall-stage artifacts a precision re-scorer should catch: `au` (#2, 0.51) is an async-task framework whose single "DAG" bullet inflates its pooled-vector cosine; `su`/`csm`/`voxy`/`theremin` and `creek`/`strand`/`reci` score high on shared jargon, not topical aboutness. Conversely, a cross-encoder reading the **full README** can rescue the thin-description cores `imbed`/`meshed`/`linked` that sank on thin text.

## Proposal
In `ir` — document and test the `rerank=` path end-to-end on the `Package` strategy: ensure `ir.retrieve.search(..., mode='hybrid', rerank=<ef cross-encoder>)` over-fetches `fetch_k` then re-scores, and that the cross-encoder sees a topical pair built from the package's **full surface text** (name + description + readme_chunk), not just the thin description. Keep `ef.reranking.cross_encoder_reranker` as the implementation — **no new ir reranker class** (`ir` composes `ef`). Add an `ir`-side knob to choose the `(query, passage)` text the reranker is fed (default: the best surface; option: concatenated sibling surfaces via `records_for_artifact`). Reranker stays opt-in / offline-default so `import ir` needs no torch.

## Experiment
On the 231-package labeled set (the **private** benchmark repo **`thorwhalen/ir-eval-data`** (access-controlled) — `package_relevance_labels.jsonl` (full 231-package graded gold labeling), `named_sets.json` (per-theme `distractors` + `hard_positives`), and `benchmark_analysis.json` (frozen `all-MiniLM-L6-v2` baseline: precision@K = recall@K ≈ 0.42 for both themes). Clone with repo access; package names are not mirrored into this public repo.), compare three single-shot configs against the all-MiniLM hybrid baseline: (A) hybrid, no rerank; (B) hybrid + `cross-encoder/ms-marco-MiniLM-L-6-v2` rerank over best-surface; (C) hybrid + cross-encoder rerank over concatenated surfaces (full README). Both theme probes. Track whether `au`/`su`/`csm`/`creek` drop out of top-20 and whether the thin-description cores `imbed`/`meshed`/`linked` survive — (B) should not push them down; (C) should pull them up. Report the latency delta (`fetch_k=50` → rerank) to confirm it stays single-shot-cheap.

## Success metric
- nDCG@20 + P@10 on graded labels.
- Named-distractor FP-rate@20 on `{au, strand, reci, creek, mv}` (graphs) and `{su, csm, voxy, theremin, haggle, farepy}` (embeddings).
- Hard-positive recall@20 on `{imbed, meshed, linked, chromadol, kroki, xcosmo, lexis}`.
- **Success = (C) cuts named-distractor FP-rate@20 by ≥40% vs (A) while holding hard-positive recall@20 ≥ (A).** Report latency.

## Data
Full 231 graded cases + named distractor + thin-description-core lists from the **private** benchmark repo **`thorwhalen/ir-eval-data`** (access-controlled) — `package_relevance_labels.jsonl` (full 231-package graded gold labeling), `named_sets.json` (per-theme `distractors` + `hard_positives`), and `benchmark_analysis.json` (frozen `all-MiniLM-L6-v2` baseline: precision@K = recall@K ≈ 0.42 for both themes). Clone with repo access; package names are not mirrored into this public repo.; the existing Package corpus (no re-embed).

https://claude.ai/code/session_01D229oNHVN1drd1mdbQL5MV

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wire ef's cross-encoder reranker as a calibrated precision stage through ir's single-shot rerank hook (audio/DAG distractor suppression) #65

Context

Problem (with our FP/FN evidence)

Proposal

Experiment

Success metric

Data

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Wire ef's cross-encoder reranker as a calibrated precision stage through ir's single-shot rerank hook (audio/DAG distractor suppression) #65

Description

Context

Problem (with our FP/FN evidence)

Proposal

Experiment

Success metric

Data

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions