Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions backend/evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Retrieval-quality eval harness (#312)

Offline, deterministic measurement of OCI's live `search_v2` ranker. Answers one
question with a number: **for a set of known queries, does search return the right
files, ranked high enough to matter?** Metrics: `recall@5`, `recall@10`, `precision@k`,
`MRR`. This is the internal regression instrument; it is not the public benchmark.

## Why it exists

You cannot improve, defend, or sell what you cannot measure. Before this, search
quality was unquantified. This harness is the speedometer: run it before and after
any change to search (a new reranker, an embedding-model swap, the v2 to v3 cutover)
and see whether the change helped or hurt.

## Tiers (Cohere reranking is pro-only)

Cohere costs money, so reranking is a **pro-tier** feature. The harness records both:

- **Free tier** (`--free-only`): no reranking. The deterministic BM25 + vector core
ranker -- what most users and agents actually get. This is the CI/regression baseline.
- **Pro tier** (`--pro-only`): Cohere reranking. Requires `COHERE_API_KEY`; without it
the pro run is skipped (loudly), never silently duplicated.

Default (`python -m evals`) runs both and prints the delta, so you can see exactly
what reranking is worth.

## Run it

From `backend/` (needs `OPENAI_API_KEY`, `PINECONE_API_KEY`, and a populated index):

```bash
pip install -r requirements-dev.txt # one-time: installs ranx
export OCI_EVAL_REPO_ID=<your-indexed-repo-id> # defaults to OCI's own repo id
python -m evals # both tiers
python -m evals --free-only # free tier only (deterministic, no Cohere)
```

Each run prints a table and writes a timestamped JSON to `results/` (git-ignored) with
a **per-query breakdown** -- the rank of every expected file -- so a regression is
diagnosable, not just detectable.

## First run is a calibration step

`expected_files` are repo-root-relative and must match the `file_path` strings the
index stores. If the first run shows recall near 0 across *all* queries, that is almost
certainly a path-format mismatch, not a bad ranker -- compare `expected_files` against
`returned_files` in the per-query breakdown and adjust the labels. Then set the baseline.

## Add a query (the rule that keeps the number honest)

Edit `ground_truth/queries.json`. Two non-negotiables:

1. **Write it agent-shaped.** Phrase it the way an agent asks (`"where is the JWT
validated before a request reaches a route"`), not keyword-shaped (`"auth"`).
2. **Label it blind.** Decide `expected_files` by *reading the repo*, never by looking
at what search returns. Grading the test against the system's own output measures
self-agreement, not correctness (FM-2 in the ADR).

## Calibrate the regression gate

`backend/evals/test_retrieval_quality.py` is a pytest gate that asserts
`recall@10 >= baseline - tolerance`. It lives here (not in `backend/tests/`) because
that suite globally mocks Pinecone + OpenAI, which would force recall to 0. Run it with
`pytest evals/ -v`. It skips until `baseline.json` has `calibrated: true`:

1. Run `python -m evals`, confirm the numbers are real (calibration above).
2. Put the free-tier `recall@10` and `mrr` into `baseline.json`, set `calibrated: true`.
3. From then on, `pytest evals/ -v` fails any change that regresses recall beyond tolerance.

## Known limitations (v0.1)

- **Local, not network-free CI.** Runs against the live index, so it needs real creds.
Committing query embeddings + an index snapshot for network-free CI is a follow-up.
- **Pre-flight checks "repo has vectors," not SHA-equality.** `repo_sha` is recorded as
documented intent; strict snapshot-pinning is a follow-up.
- **Empty results are flagged, not perfectly classified.** `search_v2` swallows errors
and returns `[]`, so a true no-hit and a swallowed error look the same here; both are
counted as misses and surfaced in `empty_query_ids` for investigation (FM-3).
8 changes: 8 additions & 0 deletions backend/evals/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
"""Retrieval-quality eval harness (OCI #312).

Offline, deterministic measurement of the live search_v2 ranker: recall@k, MRR,
precision@k over a human-labeled query set. Run via `python -m evals` from backend/.

Import isolation: this package and its deps (ranx) are imported only when an eval
runs, never on the backend startup path. See oci known-bug #3.
"""
63 changes: 63 additions & 0 deletions backend/evals/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
"""CLI entry for the retrieval-quality eval (OCI #312).

Usage (from backend/):
python -m evals # record BOTH numbers: free tier + pro tier
python -m evals --free-only # free tier only (no Cohere, deterministic)
python -m evals --pro-only # pro tier only (Cohere rerank; needs COHERE_API_KEY)
"""
import argparse
import asyncio
from typing import Any, Dict

from .runner import run_eval


# out is the heterogeneous result dict from run_eval; Any is intentional
def _print_table(label: str, out: Dict[str, Any]) -> None:
print(f"\n=== {label} (tier={out['tier']}, reranking={out['reranking']}) ===")
print(
f"queries scored: {out['n_queries_scored']} "
f"empty/ambiguous: {out['n_empty_results']} "
f"skipped repos: {len(out['skipped_repos'])}"
)
if not out["metrics"]:
print(" no metrics: no queries scored (index not populated?). See SKIP-LOUD lines above.")
return
for name, val in out["metrics"].items():
print(f" {name:<14} {val:.4f}")
if out["empty_query_ids"]:
print(
f" NOTE: empty results for queries {out['empty_query_ids']} "
f"(counted as misses; investigate index/errors before trusting these numbers)."
)


def main() -> None:
ap = argparse.ArgumentParser(prog="python -m evals", description="OCI retrieval-quality eval (#312)")
g = ap.add_mutually_exclusive_group()
g.add_argument("--free-only", action="store_true", help="free-tier only (no reranking)")
g.add_argument("--pro-only", action="store_true", help="pro-tier only (Cohere reranking)")
args = ap.parse_args()

if args.free_only:
_print_table("FREE TIER", asyncio.run(run_eval(reranking=False)))
return
if args.pro_only:
_print_table("PRO TIER", asyncio.run(run_eval(reranking=True)))
return

# default: record both numbers so we can see exactly what reranking is worth
free = asyncio.run(run_eval(reranking=False))
_print_table("FREE TIER (core ranker)", free)
try:
pro = asyncio.run(run_eval(reranking=True))
_print_table("PRO TIER (Cohere rerank)", pro)
f10, p10 = free["metrics"].get("recall@10"), pro["metrics"].get("recall@10")
if f10 is not None and p10 is not None:
print(f"\nReranking delta recall@10: {p10 - f10:+.4f} (what the pro tier buys)")
except RuntimeError as e:
print(f"\n[PRO TIER SKIPPED] {e}")


if __name__ == "__main__":
main()
13 changes: 13 additions & 0 deletions backend/evals/baseline.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"calibrated": false,
"tolerance": 0.05,
"free_core": {
"recall@10": null,
"mrr": null
},
"pro_reranked": {
"recall@10": null,
"mrr": null
},
"note": "Set calibrated=true and fill the numbers after the first real `python -m evals` run against a populated index. Until calibrated, the pytest gate skips rather than asserting against a fake baseline."
}
65 changes: 65 additions & 0 deletions backend/evals/ground_truth/queries.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
{
"_about": "Human-labeled retrieval ground truth for OCI #312. Queries are agent-shaped (the way Claude Code phrases a search_code call), NOT keyword-shaped. expected_files are labeled BLIND -- decided by reading the repo source, never by looking at what search returns (see FM-2 in the ADR). Paths are repo-root-relative and must match the file_path strings the index stores; the first run's per-query breakdown is the calibration step. repo_id defaults to OCI's own indexed repo (env OCI_EVAL_REPO_ID overrides per run).",
"queries": [
{
"query_id": "q01",
"query": "where is the JWT validated before a request reaches a route",
"expected_files": ["backend/middleware/auth.py"],
"notes": "auth middleware; JWT local decode + API-key fallback"
},
{
"query_id": "q02",
"query": "how does hybrid search combine BM25 keyword scores with semantic vector scores",
"expected_files": ["backend/services/search_v2/hybrid_searcher.py"],
"notes": "RRF fusion of BM25 + semantic candidates"
},
{
"query_id": "q03",
"query": "where is Cohere reranking applied to the search candidates",
"expected_files": ["backend/services/search_v2/hybrid_searcher.py"],
"notes": "rerank step gated on COHERE_API_KEY (pro tier)"
},
{
"query_id": "q04",
"query": "how are code embeddings created in batches for indexing",
"expected_files": ["backend/services/indexer_optimized.py"],
"notes": "_create_embeddings_batch, EMBEDDING_BATCH_SIZE"
},
{
"query_id": "q05",
"query": "which API route handles the version 2 search request",
"expected_files": ["backend/routes/search_v2.py"],
"notes": "POST /search/v2 route; the path the MCP search_code tool hits"
},
{
"query_id": "q06",
"query": "how is repository cloning made durable and recovered when an indexing job gets stuck",
"expected_files": ["backend/services/repo_manager.py", "backend/services/supabase_service.py"],
"notes": "ensure_clone chokepoint + reset_stuck_indexing_jobs (PR #316)"
},
{
"query_id": "q07",
"query": "where are import dependency graphs extracted from the parsed AST",
"expected_files": ["backend/services/dependency_analyzer.py"],
"notes": "tree-sitter import graph extraction, singleton service"
},
{
"query_id": "q08",
"query": "how does the MCP server forward a search_code tool call to the backend API",
"expected_files": ["mcp-server/handlers.py"],
"notes": "_handle_search posts to /search/v2"
},
{
"query_id": "q09",
"query": "where is the codebase DNA and architectural pattern detection implemented",
"expected_files": ["backend/services/dna_extractor.py"],
"notes": "DNAExtractor.extract, team-rules detection"
},
{
"query_id": "q10",
"query": "how does file path filtering avoid matching adjacent sibling directories",
"expected_files": ["backend/services/indexer_optimized.py"],
"notes": "_discover_code_files include_paths; Path.parts not str.startswith (known-bug #5)"
}
]
}
39 changes: 39 additions & 0 deletions backend/evals/metrics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
"""Retrieval metrics, computed by ranx.

ranx is imported lazily inside the function: it is a heavy optional dep and must
never be on the backend startup import path (oci known-bug #3). We buy the metric
math instead of hand-rolling it -- recall@k / MRR are a classic subtle-bug factory,
and a buggy metric makes the eval confidently wrong (worse than no eval).
"""
from typing import Dict, Sequence


def compute_metrics(
qrels: Dict[str, Dict[str, int]],
run: Dict[str, Dict[str, float]],
k_values: Sequence[int] = (5, 10),
) -> Dict[str, float]:
"""Compute recall@k, precision@k for each k, plus MRR (over the returned set).

qrels: {query_id: {doc_id: relevance>=1}} -- the human-labeled answer key.
run: {query_id: {doc_id: score}} -- what search returned.
"""
try:
from ranx import Qrels, Run, evaluate
except ModuleNotFoundError as e:
raise ModuleNotFoundError(
"ranx is required for eval metrics. Install dev deps: "
"pip install -r backend/requirements-dev.txt"
) from e

metric_names = []
for k in k_values:
metric_names.append(f"recall@{k}")
metric_names.append(f"precision@{k}")
metric_names.append("mrr") # results are already capped at top_k, so this is MRR@top_k

scores = evaluate(Qrels(qrels), Run(run), metric_names)
# ranx returns a bare float when a single metric is requested; a dict otherwise.
if isinstance(scores, (int, float)):
scores = {metric_names[0]: scores}
return {m: float(scores[m]) for m in metric_names}
3 changes: 3 additions & 0 deletions backend/evals/results/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Per-run eval artifacts are not committed; the durable comparison point is baseline.json.
*
!.gitignore
Loading
Loading