test: retrieval-quality regression gate, seed ground truth, docs (#312)

DevanshuNEU · DevanshuNEU · commit 1a24f3174b50 · 2026-06-10T23:47:25.000-04:00
pytest gate asserts recall@10 &gt;= baseline - tolerance. Lives under evals/
(not tests/) because pytest.ini testpaths=tests and tests/conftest.py mocks
Pinecone+OpenAI globally, which would force recall to 0; the gate needs the
real index. Skips until baseline.json is calibrated, so it never asserts
against a fake number.

Ground truth: 10 agent-shaped queries labeled blind against OCI's own source.
Empty search results are flagged and counted as misses, not hidden, since
search_v2 swallows errors to [] (FM-3).
diff --git a/backend/evals/README.md b/backend/evals/README.md
@@ -0,0 +1,78 @@
+# Retrieval-quality eval harness (#312)
+
+Offline, deterministic measurement of OCI's live `search_v2` ranker. Answers one
+question with a number: **for a set of known queries, does search return the right
+files, ranked high enough to matter?** Metrics: `recall@5`, `recall@10`, `precision@k`,
+`MRR`. This is the internal regression instrument; it is not the public benchmark.
+
+## Why it exists
+
+You cannot improve, defend, or sell what you cannot measure. Before this, search
+quality was unquantified. This harness is the speedometer: run it before and after
+any change to search (a new reranker, an embedding-model swap, the v2 to v3 cutover)
+and see whether the change helped or hurt.
+
+## Tiers (Cohere reranking is pro-only)
+
+Cohere costs money, so reranking is a **pro-tier** feature. The harness records both:
+
+- **Free tier** (`--free-only`): no reranking. The deterministic BM25 + vector core
+  ranker -- what most users and agents actually get. This is the CI/regression baseline.
+- **Pro tier** (`--pro-only`): Cohere reranking. Requires `COHERE_API_KEY`; without it
+  the pro run is skipped (loudly), never silently duplicated.
+
+Default (`python -m evals`) runs both and prints the delta, so you can see exactly
+what reranking is worth.
+
+## Run it
+
+From `backend/` (needs `OPENAI_API_KEY`, `PINECONE_API_KEY`, and a populated index):
+
+```bash
+pip install -r requirements-dev.txt        # one-time: installs ranx
+export OCI_EVAL_REPO_ID=<your-indexed-repo-id>   # defaults to OCI's own repo id
+python -m evals                            # both tiers
+python -m evals --free-only                # free tier only (deterministic, no Cohere)
+```
+
+Each run prints a table and writes a timestamped JSON to `results/` (git-ignored) with
+a **per-query breakdown** -- the rank of every expected file -- so a regression is
+diagnosable, not just detectable.
+
+## First run is a calibration step
+
+`expected_files` are repo-root-relative and must match the `file_path` strings the
+index stores. If the first run shows recall near 0 across *all* queries, that is almost
+certainly a path-format mismatch, not a bad ranker -- compare `expected_files` against
+`returned_files` in the per-query breakdown and adjust the labels. Then set the baseline.
+
+## Add a query (the rule that keeps the number honest)
+
+Edit `ground_truth/queries.json`. Two non-negotiables:
+
+1. **Write it agent-shaped.** Phrase it the way an agent asks (`"where is the JWT
+   validated before a request reaches a route"`), not keyword-shaped (`"auth"`).
+2. **Label it blind.** Decide `expected_files` by *reading the repo*, never by looking
+   at what search returns. Grading the test against the system's own output measures
+   self-agreement, not correctness (FM-2 in the ADR).
+
+## Calibrate the regression gate
+
+`backend/evals/test_retrieval_quality.py` is a pytest gate that asserts
+`recall@10 >= baseline - tolerance`. It lives here (not in `backend/tests/`) because
+that suite globally mocks Pinecone + OpenAI, which would force recall to 0. Run it with
+`pytest evals/ -v`. It skips until `baseline.json` has `calibrated: true`:
+
+1. Run `python -m evals`, confirm the numbers are real (calibration above).
+2. Put the free-tier `recall@10` and `mrr` into `baseline.json`, set `calibrated: true`.
+3. From then on, `pytest evals/ -v` fails any change that regresses recall beyond tolerance.
+
+## Known limitations (v0.1)
+
+- **Local, not network-free CI.** Runs against the live index, so it needs real creds.
+  Committing query embeddings + an index snapshot for network-free CI is a follow-up.
+- **Pre-flight checks "repo has vectors," not SHA-equality.** `repo_sha` is recorded as
+  documented intent; strict snapshot-pinning is a follow-up.
+- **Empty results are flagged, not perfectly classified.** `search_v2` swallows errors
+  and returns `[]`, so a true no-hit and a swallowed error look the same here; both are
+  counted as misses and surfaced in `empty_query_ids` for investigation (FM-3).
diff --git a/backend/evals/baseline.json b/backend/evals/baseline.json
@@ -0,0 +1,13 @@
+{
+  "calibrated": false,
+  "tolerance": 0.05,
+  "free_core": {
+    "recall@10": null,
+    "mrr": null
+  },
+  "pro_reranked": {
+    "recall@10": null,
+    "mrr": null
+  },
+  "note": "Set calibrated=true and fill the numbers after the first real `python -m evals` run against a populated index. Until calibrated, the pytest gate skips rather than asserting against a fake baseline."
+}
diff --git a/backend/evals/ground_truth/queries.json b/backend/evals/ground_truth/queries.json
@@ -0,0 +1,65 @@
+{
+  "_about": "Human-labeled retrieval ground truth for OCI #312. Queries are agent-shaped (the way Claude Code phrases a search_code call), NOT keyword-shaped. expected_files are labeled BLIND -- decided by reading the repo source, never by looking at what search returns (see FM-2 in the ADR). Paths are repo-root-relative and must match the file_path strings the index stores; the first run's per-query breakdown is the calibration step. repo_id defaults to OCI's own indexed repo (env OCI_EVAL_REPO_ID overrides per run).",
+  "queries": [
+    {
+      "query_id": "q01",
+      "query": "where is the JWT validated before a request reaches a route",
+      "expected_files": ["backend/middleware/auth.py"],
+      "notes": "auth middleware; JWT local decode + API-key fallback"
+    },
+    {
+      "query_id": "q02",
+      "query": "how does hybrid search combine BM25 keyword scores with semantic vector scores",
+      "expected_files": ["backend/services/search_v2/hybrid_searcher.py"],
+      "notes": "RRF fusion of BM25 + semantic candidates"
+    },
+    {
+      "query_id": "q03",
+      "query": "where is Cohere reranking applied to the search candidates",
+      "expected_files": ["backend/services/search_v2/hybrid_searcher.py"],
+      "notes": "rerank step gated on COHERE_API_KEY (pro tier)"
+    },
+    {
+      "query_id": "q04",
+      "query": "how are code embeddings created in batches for indexing",
+      "expected_files": ["backend/services/indexer_optimized.py"],
+      "notes": "_create_embeddings_batch, EMBEDDING_BATCH_SIZE"
+    },
+    {
+      "query_id": "q05",
+      "query": "which API route handles the version 2 search request",
+      "expected_files": ["backend/routes/search_v2.py"],
+      "notes": "POST /search/v2 route; the path the MCP search_code tool hits"
+    },
+    {
+      "query_id": "q06",
+      "query": "how is repository cloning made durable and recovered when an indexing job gets stuck",
+      "expected_files": ["backend/services/repo_manager.py", "backend/services/supabase_service.py"],
+      "notes": "ensure_clone chokepoint + reset_stuck_indexing_jobs (PR #316)"
+    },
+    {
+      "query_id": "q07",
+      "query": "where are import dependency graphs extracted from the parsed AST",
+      "expected_files": ["backend/services/dependency_analyzer.py"],
+      "notes": "tree-sitter import graph extraction, singleton service"
+    },
+    {
+      "query_id": "q08",
+      "query": "how does the MCP server forward a search_code tool call to the backend API",
+      "expected_files": ["mcp-server/handlers.py"],
+      "notes": "_handle_search posts to /search/v2"
+    },
+    {
+      "query_id": "q09",
+      "query": "where is the codebase DNA and architectural pattern detection implemented",
+      "expected_files": ["backend/services/dna_extractor.py"],
+      "notes": "DNAExtractor.extract, team-rules detection"
+    },
+    {
+      "query_id": "q10",
+      "query": "how does file path filtering avoid matching adjacent sibling directories",
+      "expected_files": ["backend/services/indexer_optimized.py"],
+      "notes": "_discover_code_files include_paths; Path.parts not str.startswith (known-bug #5)"
+    }
+  ]
+}
diff --git a/backend/evals/results/.gitignore b/backend/evals/results/.gitignore
@@ -0,0 +1,3 @@
+# Per-run eval artifacts are not committed; the durable comparison point is baseline.json.
+*
+!.gitignore
diff --git a/backend/evals/test_retrieval_quality.py b/backend/evals/test_retrieval_quality.py
@@ -0,0 +1,36 @@
+"""Retrieval-quality regression gate (OCI #312).
+
+This lives under backend/evals/ (NOT backend/tests/) on purpose: tests/conftest.py
+mocks Pinecone + OpenAI globally via autouse fixtures, which would force recall to 0.
+This gate needs the REAL index, so it runs against live services and is excluded from
+the mocked `pytest tests/` suite. Run it explicitly:
+
+    cd backend && pytest evals/ -v
+
+It self-skips unless real creds are present AND a baseline has been calibrated.
+"""
+import asyncio
+import os
+
+import pytest
+
+from evals.runner import load_baseline, run_eval
+
+_REQUIRED_ENV = ("OPENAI_API_KEY", "PINECONE_API_KEY")
+
+pytestmark = pytest.mark.skipif(
+    not all(os.getenv(k) for k in _REQUIRED_ENV),
+    reason="retrieval eval needs real OPENAI_API_KEY + PINECONE_API_KEY + a populated index",
+)
+
+
+def test_recall_at_10_meets_baseline():
+    baseline = load_baseline()
+    if not baseline.get("calibrated"):
+        pytest.skip("baseline not calibrated yet; run `python -m evals` and record numbers first")
+
+    out = asyncio.run(run_eval(reranking=False))
+    tol = baseline.get("tolerance", 0.05)
+    floor = baseline["free_core"]["recall@10"] - tol
+    actual = out["metrics"].get("recall@10", 0.0)
+    assert actual >= floor, f"recall@10 {actual:.4f} below baseline floor {floor:.4f}"

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Per-run eval artifacts are not committed; the durable comparison point is baseline.json.`
	`2`	`+*`
	`3`	`+!.gitignore`