Finding (dogfood F-017)
Surfaced while calibrating the retrieval-quality eval harness (#312, PR #321) against the OCI repo index over the 10-query ground-truth set.
Cohere reranking is net-negative on rank quality on this set:
| Tier |
recall@10 |
MRR |
| Free (core ranker) |
0.80 |
0.80 |
| Pro (Cohere rerank) |
0.85 |
0.658 |
Reranking buys +0.05 recall@10 (pulls one extra expected file into the top 10) but drops MRR 0.80 -> 0.658 by demoting rank-1 hits.
Why this matters
For an agent consumer, the rank-1 hit carries most of the value: it reads top results first and pays per token, so a strong top hit is worth more than marginally better deep-list coverage. Trading MRR for recall is the wrong trade here, yet reranking is the pro-tier default. A paying user can get worse top-rank quality than the free tier.
Suggested investigation
- Inspect the per-query breakdown (eval
results/ JSON) to identify exactly which queries the reranker demotes.
- Options to evaluate:
- Only rerank when the core ranker's top hit is low-confidence (conditional rerank).
- Blend the rerank score with the original rank instead of a full reorder.
- Reconsider whether Cohere rerank should be the pro-tier default on small codebases at all.
- Re-run
python -m evals after any change and compare against the calibrated baseline (free recall@10 0.80 / MRR 0.80).
Notes
Finding (dogfood F-017)
Surfaced while calibrating the retrieval-quality eval harness (#312, PR #321) against the OCI repo index over the 10-query ground-truth set.
Cohere reranking is net-negative on rank quality on this set:
Reranking buys +0.05 recall@10 (pulls one extra expected file into the top 10) but drops MRR 0.80 -> 0.658 by demoting rank-1 hits.
Why this matters
For an agent consumer, the rank-1 hit carries most of the value: it reads top results first and pays per token, so a strong top hit is worth more than marginally better deep-list coverage. Trading MRR for recall is the wrong trade here, yet reranking is the pro-tier default. A paying user can get worse top-rank quality than the free tier.
Suggested investigation
results/JSON) to identify exactly which queries the reranker demotes.python -m evalsafter any change and compare against the calibrated baseline (free recall@10 0.80 / MRR 0.80).Notes