Skip to content

Wire ef's cross-encoder reranker as a calibrated precision stage through ir's single-shot rerank hook (audio/DAG distractor suppression) #65

Description

@thorwhalen

Context

Thread: #61. ir.retrieve.search already has a rerank= hook (defaults None) that over-fetches fetch_k then delegates to ef.rerank via _apply_rerank (ir/retrieve.py:291,329), and ef.reranking already ships cross_encoder_reranker + the with_reranker over-fetch decorator (ef/reranking.py:129,177). But the two are not wired or calibrated on our data, so today's single-shot ir search ships the distractors near the top. Per the #38 decision rule this is legitimately ir's: a cross-encoder precision pass makes single-shot, agent-free search measurably better — it is NOT the agentic back-edge.

Problem (with our FP/FN evidence)

Many of our worst false positives are recall-stage artifacts a precision re-scorer should catch: au (#2, 0.51) is an async-task framework whose single "DAG" bullet inflates its pooled-vector cosine; su/csm/voxy/theremin and creek/strand/reci score high on shared jargon, not topical aboutness. Conversely, a cross-encoder reading the full README can rescue the thin-description cores imbed/meshed/linked that sank on thin text.

Proposal

In ir — document and test the rerank= path end-to-end on the Package strategy: ensure ir.retrieve.search(..., mode='hybrid', rerank=<ef cross-encoder>) over-fetches fetch_k then re-scores, and that the cross-encoder sees a topical pair built from the package's full surface text (name + description + readme_chunk), not just the thin description. Keep ef.reranking.cross_encoder_reranker as the implementation — no new ir reranker class (ir composes ef). Add an ir-side knob to choose the (query, passage) text the reranker is fed (default: the best surface; option: concatenated sibling surfaces via records_for_artifact). Reranker stays opt-in / offline-default so import ir needs no torch.

Experiment

On the 231-package labeled set (the private benchmark repo thorwhalen/ir-eval-data (access-controlled) — package_relevance_labels.jsonl (full 231-package graded gold labeling), named_sets.json (per-theme distractors + hard_positives), and benchmark_analysis.json (frozen all-MiniLM-L6-v2 baseline: precision@K = recall@K ≈ 0.42 for both themes). Clone with repo access; package names are not mirrored into this public repo.), compare three single-shot configs against the all-MiniLM hybrid baseline: (A) hybrid, no rerank; (B) hybrid + cross-encoder/ms-marco-MiniLM-L-6-v2 rerank over best-surface; (C) hybrid + cross-encoder rerank over concatenated surfaces (full README). Both theme probes. Track whether au/su/csm/creek drop out of top-20 and whether the thin-description cores imbed/meshed/linked survive — (B) should not push them down; (C) should pull them up. Report the latency delta (fetch_k=50 → rerank) to confirm it stays single-shot-cheap.

Success metric

  • nDCG@20 + P@10 on graded labels.
  • Named-distractor FP-rate@20 on {au, strand, reci, creek, mv} (graphs) and {su, csm, voxy, theremin, haggle, farepy} (embeddings).
  • Hard-positive recall@20 on {imbed, meshed, linked, chromadol, kroki, xcosmo, lexis}.
  • Success = (C) cuts named-distractor FP-rate@20 by ≥40% vs (A) while holding hard-positive recall@20 ≥ (A). Report latency.

Data

Full 231 graded cases + named distractor + thin-description-core lists from the private benchmark repo thorwhalen/ir-eval-data (access-controlled) — package_relevance_labels.jsonl (full 231-package graded gold labeling), named_sets.json (per-theme distractors + hard_positives), and benchmark_analysis.json (frozen all-MiniLM-L6-v2 baseline: precision@K = recall@K ≈ 0.42 for both themes). Clone with repo access; package names are not mirrored into this public repo.; the existing Package corpus (no re-embed).

https://claude.ai/code/session_01D229oNHVN1drd1mdbQL5MV

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions