Skip to content

Commit 022ef14

Browse files
authored
Merge pull request #318 from DevanshuNEU/feat/retrieval-eval-harness-v0.1
feat: retrieval-quality eval harness (recall@10, MRR) for the live search_v2 ranker
2 parents ca451f0 + db87f0a commit 022ef14

10 files changed

Lines changed: 463 additions & 0 deletions

File tree

backend/evals/README.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Retrieval-quality eval harness (#312)
2+
3+
Offline, deterministic measurement of OCI's live `search_v2` ranker. Answers one
4+
question with a number: **for a set of known queries, does search return the right
5+
files, ranked high enough to matter?** Metrics: `recall@5`, `recall@10`, `precision@k`,
6+
`MRR`. This is the internal regression instrument; it is not the public benchmark.
7+
8+
## Why it exists
9+
10+
You cannot improve, defend, or sell what you cannot measure. Before this, search
11+
quality was unquantified. This harness is the speedometer: run it before and after
12+
any change to search (a new reranker, an embedding-model swap, the v2 to v3 cutover)
13+
and see whether the change helped or hurt.
14+
15+
## Tiers (Cohere reranking is pro-only)
16+
17+
Cohere costs money, so reranking is a **pro-tier** feature. The harness records both:
18+
19+
- **Free tier** (`--free-only`): no reranking. The deterministic BM25 + vector core
20+
ranker -- what most users and agents actually get. This is the CI/regression baseline.
21+
- **Pro tier** (`--pro-only`): Cohere reranking. Requires `COHERE_API_KEY`; without it
22+
the pro run is skipped (loudly), never silently duplicated.
23+
24+
Default (`python -m evals`) runs both and prints the delta, so you can see exactly
25+
what reranking is worth.
26+
27+
## Run it
28+
29+
From `backend/` (needs `OPENAI_API_KEY`, `PINECONE_API_KEY`, and a populated index):
30+
31+
```bash
32+
pip install -r requirements-dev.txt # one-time: installs ranx
33+
export OCI_EVAL_REPO_ID=<your-indexed-repo-id> # defaults to OCI's own repo id
34+
python -m evals # both tiers
35+
python -m evals --free-only # free tier only (deterministic, no Cohere)
36+
```
37+
38+
Each run prints a table and writes a timestamped JSON to `results/` (git-ignored) with
39+
a **per-query breakdown** -- the rank of every expected file -- so a regression is
40+
diagnosable, not just detectable.
41+
42+
## First run is a calibration step
43+
44+
`expected_files` are repo-root-relative and must match the `file_path` strings the
45+
index stores. If the first run shows recall near 0 across *all* queries, that is almost
46+
certainly a path-format mismatch, not a bad ranker -- compare `expected_files` against
47+
`returned_files` in the per-query breakdown and adjust the labels. Then set the baseline.
48+
49+
## Add a query (the rule that keeps the number honest)
50+
51+
Edit `ground_truth/queries.json`. Two non-negotiables:
52+
53+
1. **Write it agent-shaped.** Phrase it the way an agent asks (`"where is the JWT
54+
validated before a request reaches a route"`), not keyword-shaped (`"auth"`).
55+
2. **Label it blind.** Decide `expected_files` by *reading the repo*, never by looking
56+
at what search returns. Grading the test against the system's own output measures
57+
self-agreement, not correctness (FM-2 in the ADR).
58+
59+
## Calibrate the regression gate
60+
61+
`backend/evals/test_retrieval_quality.py` is a pytest gate that asserts
62+
`recall@10 >= baseline - tolerance`. It lives here (not in `backend/tests/`) because
63+
that suite globally mocks Pinecone + OpenAI, which would force recall to 0. Run it with
64+
`pytest evals/ -v`. It skips until `baseline.json` has `calibrated: true`:
65+
66+
1. Run `python -m evals`, confirm the numbers are real (calibration above).
67+
2. Put the free-tier `recall@10` and `mrr` into `baseline.json`, set `calibrated: true`.
68+
3. From then on, `pytest evals/ -v` fails any change that regresses recall beyond tolerance.
69+
70+
## Known limitations (v0.1)
71+
72+
- **Local, not network-free CI.** Runs against the live index, so it needs real creds.
73+
Committing query embeddings + an index snapshot for network-free CI is a follow-up.
74+
- **Pre-flight checks "repo has vectors," not SHA-equality.** `repo_sha` is recorded as
75+
documented intent; strict snapshot-pinning is a follow-up.
76+
- **Empty results are flagged, not perfectly classified.** `search_v2` swallows errors
77+
and returns `[]`, so a true no-hit and a swallowed error look the same here; both are
78+
counted as misses and surfaced in `empty_query_ids` for investigation (FM-3).

backend/evals/__init__.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
"""Retrieval-quality eval harness (OCI #312).
2+
3+
Offline, deterministic measurement of the live search_v2 ranker: recall@k, MRR,
4+
precision@k over a human-labeled query set. Run via `python -m evals` from backend/.
5+
6+
Import isolation: this package and its deps (ranx) are imported only when an eval
7+
runs, never on the backend startup path. See oci known-bug #3.
8+
"""

backend/evals/__main__.py

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
"""CLI entry for the retrieval-quality eval (OCI #312).
2+
3+
Usage (from backend/):
4+
python -m evals # record BOTH numbers: free tier + pro tier
5+
python -m evals --free-only # free tier only (no Cohere, deterministic)
6+
python -m evals --pro-only # pro tier only (Cohere rerank; needs COHERE_API_KEY)
7+
"""
8+
import argparse
9+
import asyncio
10+
from typing import Any, Dict
11+
12+
from .runner import run_eval
13+
14+
15+
# out is the heterogeneous result dict from run_eval; Any is intentional
16+
def _print_table(label: str, out: Dict[str, Any]) -> None:
17+
print(f"\n=== {label} (tier={out['tier']}, reranking={out['reranking']}) ===")
18+
print(
19+
f"queries scored: {out['n_queries_scored']} "
20+
f"empty/ambiguous: {out['n_empty_results']} "
21+
f"skipped repos: {len(out['skipped_repos'])}"
22+
)
23+
if not out["metrics"]:
24+
print(" no metrics: no queries scored (index not populated?). See SKIP-LOUD lines above.")
25+
return
26+
for name, val in out["metrics"].items():
27+
print(f" {name:<14} {val:.4f}")
28+
if out["empty_query_ids"]:
29+
print(
30+
f" NOTE: empty results for queries {out['empty_query_ids']} "
31+
f"(counted as misses; investigate index/errors before trusting these numbers)."
32+
)
33+
34+
35+
def main() -> None:
36+
ap = argparse.ArgumentParser(prog="python -m evals", description="OCI retrieval-quality eval (#312)")
37+
g = ap.add_mutually_exclusive_group()
38+
g.add_argument("--free-only", action="store_true", help="free-tier only (no reranking)")
39+
g.add_argument("--pro-only", action="store_true", help="pro-tier only (Cohere reranking)")
40+
args = ap.parse_args()
41+
42+
if args.free_only:
43+
_print_table("FREE TIER", asyncio.run(run_eval(reranking=False)))
44+
return
45+
if args.pro_only:
46+
_print_table("PRO TIER", asyncio.run(run_eval(reranking=True)))
47+
return
48+
49+
# default: record both numbers so we can see exactly what reranking is worth
50+
free = asyncio.run(run_eval(reranking=False))
51+
_print_table("FREE TIER (core ranker)", free)
52+
try:
53+
pro = asyncio.run(run_eval(reranking=True))
54+
_print_table("PRO TIER (Cohere rerank)", pro)
55+
f10, p10 = free["metrics"].get("recall@10"), pro["metrics"].get("recall@10")
56+
if f10 is not None and p10 is not None:
57+
print(f"\nReranking delta recall@10: {p10 - f10:+.4f} (what the pro tier buys)")
58+
except RuntimeError as e:
59+
print(f"\n[PRO TIER SKIPPED] {e}")
60+
61+
62+
if __name__ == "__main__":
63+
main()

backend/evals/baseline.json

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
{
2+
"calibrated": false,
3+
"tolerance": 0.05,
4+
"free_core": {
5+
"recall@10": null,
6+
"mrr": null
7+
},
8+
"pro_reranked": {
9+
"recall@10": null,
10+
"mrr": null
11+
},
12+
"note": "Set calibrated=true and fill the numbers after the first real `python -m evals` run against a populated index. Until calibrated, the pytest gate skips rather than asserting against a fake baseline."
13+
}
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
{
2+
"_about": "Human-labeled retrieval ground truth for OCI #312. Queries are agent-shaped (the way Claude Code phrases a search_code call), NOT keyword-shaped. expected_files are labeled BLIND -- decided by reading the repo source, never by looking at what search returns (see FM-2 in the ADR). Paths are repo-root-relative and must match the file_path strings the index stores; the first run's per-query breakdown is the calibration step. repo_id defaults to OCI's own indexed repo (env OCI_EVAL_REPO_ID overrides per run).",
3+
"queries": [
4+
{
5+
"query_id": "q01",
6+
"query": "where is the JWT validated before a request reaches a route",
7+
"expected_files": ["backend/middleware/auth.py"],
8+
"notes": "auth middleware; JWT local decode + API-key fallback"
9+
},
10+
{
11+
"query_id": "q02",
12+
"query": "how does hybrid search combine BM25 keyword scores with semantic vector scores",
13+
"expected_files": ["backend/services/search_v2/hybrid_searcher.py"],
14+
"notes": "RRF fusion of BM25 + semantic candidates"
15+
},
16+
{
17+
"query_id": "q03",
18+
"query": "where is Cohere reranking applied to the search candidates",
19+
"expected_files": ["backend/services/search_v2/hybrid_searcher.py"],
20+
"notes": "rerank step gated on COHERE_API_KEY (pro tier)"
21+
},
22+
{
23+
"query_id": "q04",
24+
"query": "how are code embeddings created in batches for indexing",
25+
"expected_files": ["backend/services/indexer_optimized.py"],
26+
"notes": "_create_embeddings_batch, EMBEDDING_BATCH_SIZE"
27+
},
28+
{
29+
"query_id": "q05",
30+
"query": "which API route handles the version 2 search request",
31+
"expected_files": ["backend/routes/search_v2.py"],
32+
"notes": "POST /search/v2 route; the path the MCP search_code tool hits"
33+
},
34+
{
35+
"query_id": "q06",
36+
"query": "how is repository cloning made durable and recovered when an indexing job gets stuck",
37+
"expected_files": ["backend/services/repo_manager.py", "backend/services/supabase_service.py"],
38+
"notes": "ensure_clone chokepoint + reset_stuck_indexing_jobs (PR #316)"
39+
},
40+
{
41+
"query_id": "q07",
42+
"query": "where are import dependency graphs extracted from the parsed AST",
43+
"expected_files": ["backend/services/dependency_analyzer.py"],
44+
"notes": "tree-sitter import graph extraction, singleton service"
45+
},
46+
{
47+
"query_id": "q08",
48+
"query": "how does the MCP server forward a search_code tool call to the backend API",
49+
"expected_files": ["mcp-server/handlers.py"],
50+
"notes": "_handle_search posts to /search/v2"
51+
},
52+
{
53+
"query_id": "q09",
54+
"query": "where is the codebase DNA and architectural pattern detection implemented",
55+
"expected_files": ["backend/services/dna_extractor.py"],
56+
"notes": "DNAExtractor.extract, team-rules detection"
57+
},
58+
{
59+
"query_id": "q10",
60+
"query": "how does file path filtering avoid matching adjacent sibling directories",
61+
"expected_files": ["backend/services/indexer_optimized.py"],
62+
"notes": "_discover_code_files include_paths; Path.parts not str.startswith (known-bug #5)"
63+
}
64+
]
65+
}

backend/evals/metrics.py

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
"""Retrieval metrics, computed by ranx.
2+
3+
ranx is imported lazily inside the function: it is a heavy optional dep and must
4+
never be on the backend startup import path (oci known-bug #3). We buy the metric
5+
math instead of hand-rolling it -- recall@k / MRR are a classic subtle-bug factory,
6+
and a buggy metric makes the eval confidently wrong (worse than no eval).
7+
"""
8+
from typing import Dict, Sequence
9+
10+
11+
def compute_metrics(
12+
qrels: Dict[str, Dict[str, int]],
13+
run: Dict[str, Dict[str, float]],
14+
k_values: Sequence[int] = (5, 10),
15+
) -> Dict[str, float]:
16+
"""Compute recall@k, precision@k for each k, plus MRR (over the returned set).
17+
18+
qrels: {query_id: {doc_id: relevance>=1}} -- the human-labeled answer key.
19+
run: {query_id: {doc_id: score}} -- what search returned.
20+
"""
21+
try:
22+
from ranx import Qrels, Run, evaluate
23+
except ModuleNotFoundError as e:
24+
raise ModuleNotFoundError(
25+
"ranx is required for eval metrics. Install dev deps: "
26+
"pip install -r backend/requirements-dev.txt"
27+
) from e
28+
29+
metric_names = []
30+
for k in k_values:
31+
metric_names.append(f"recall@{k}")
32+
metric_names.append(f"precision@{k}")
33+
metric_names.append("mrr") # results are already capped at top_k, so this is MRR@top_k
34+
35+
scores = evaluate(Qrels(qrels), Run(run), metric_names)
36+
# ranx returns a bare float when a single metric is requested; a dict otherwise.
37+
if isinstance(scores, (int, float)):
38+
scores = {metric_names[0]: scores}
39+
return {m: float(scores[m]) for m in metric_names}

backend/evals/results/.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Per-run eval artifacts are not committed; the durable comparison point is baseline.json.
2+
*
3+
!.gitignore

0 commit comments

Comments
 (0)