|
| 1 | +# Retrieval-quality eval harness (#312) |
| 2 | + |
| 3 | +Offline, deterministic measurement of OCI's live `search_v2` ranker. Answers one |
| 4 | +question with a number: **for a set of known queries, does search return the right |
| 5 | +files, ranked high enough to matter?** Metrics: `recall@5`, `recall@10`, `precision@k`, |
| 6 | +`MRR`. This is the internal regression instrument; it is not the public benchmark. |
| 7 | + |
| 8 | +## Why it exists |
| 9 | + |
| 10 | +You cannot improve, defend, or sell what you cannot measure. Before this, search |
| 11 | +quality was unquantified. This harness is the speedometer: run it before and after |
| 12 | +any change to search (a new reranker, an embedding-model swap, the v2 to v3 cutover) |
| 13 | +and see whether the change helped or hurt. |
| 14 | + |
| 15 | +## Tiers (Cohere reranking is pro-only) |
| 16 | + |
| 17 | +Cohere costs money, so reranking is a **pro-tier** feature. The harness records both: |
| 18 | + |
| 19 | +- **Free tier** (`--free-only`): no reranking. The deterministic BM25 + vector core |
| 20 | + ranker -- what most users and agents actually get. This is the CI/regression baseline. |
| 21 | +- **Pro tier** (`--pro-only`): Cohere reranking. Requires `COHERE_API_KEY`; without it |
| 22 | + the pro run is skipped (loudly), never silently duplicated. |
| 23 | + |
| 24 | +Default (`python -m evals`) runs both and prints the delta, so you can see exactly |
| 25 | +what reranking is worth. |
| 26 | + |
| 27 | +## Run it |
| 28 | + |
| 29 | +From `backend/` (needs `OPENAI_API_KEY`, `PINECONE_API_KEY`, and a populated index): |
| 30 | + |
| 31 | +```bash |
| 32 | +pip install -r requirements-dev.txt # one-time: installs ranx |
| 33 | +export OCI_EVAL_REPO_ID=<your-indexed-repo-id> # defaults to OCI's own repo id |
| 34 | +python -m evals # both tiers |
| 35 | +python -m evals --free-only # free tier only (deterministic, no Cohere) |
| 36 | +``` |
| 37 | + |
| 38 | +Each run prints a table and writes a timestamped JSON to `results/` (git-ignored) with |
| 39 | +a **per-query breakdown** -- the rank of every expected file -- so a regression is |
| 40 | +diagnosable, not just detectable. |
| 41 | + |
| 42 | +## First run is a calibration step |
| 43 | + |
| 44 | +`expected_files` are repo-root-relative and must match the `file_path` strings the |
| 45 | +index stores. If the first run shows recall near 0 across *all* queries, that is almost |
| 46 | +certainly a path-format mismatch, not a bad ranker -- compare `expected_files` against |
| 47 | +`returned_files` in the per-query breakdown and adjust the labels. Then set the baseline. |
| 48 | + |
| 49 | +## Add a query (the rule that keeps the number honest) |
| 50 | + |
| 51 | +Edit `ground_truth/queries.json`. Two non-negotiables: |
| 52 | + |
| 53 | +1. **Write it agent-shaped.** Phrase it the way an agent asks (`"where is the JWT |
| 54 | + validated before a request reaches a route"`), not keyword-shaped (`"auth"`). |
| 55 | +2. **Label it blind.** Decide `expected_files` by *reading the repo*, never by looking |
| 56 | + at what search returns. Grading the test against the system's own output measures |
| 57 | + self-agreement, not correctness (FM-2 in the ADR). |
| 58 | + |
| 59 | +## Calibrate the regression gate |
| 60 | + |
| 61 | +`backend/evals/test_retrieval_quality.py` is a pytest gate that asserts |
| 62 | +`recall@10 >= baseline - tolerance`. It lives here (not in `backend/tests/`) because |
| 63 | +that suite globally mocks Pinecone + OpenAI, which would force recall to 0. Run it with |
| 64 | +`pytest evals/ -v`. It skips until `baseline.json` has `calibrated: true`: |
| 65 | + |
| 66 | +1. Run `python -m evals`, confirm the numbers are real (calibration above). |
| 67 | +2. Put the free-tier `recall@10` and `mrr` into `baseline.json`, set `calibrated: true`. |
| 68 | +3. From then on, `pytest evals/ -v` fails any change that regresses recall beyond tolerance. |
| 69 | + |
| 70 | +## Known limitations (v0.1) |
| 71 | + |
| 72 | +- **Local, not network-free CI.** Runs against the live index, so it needs real creds. |
| 73 | + Committing query embeddings + an index snapshot for network-free CI is a follow-up. |
| 74 | +- **Pre-flight checks "repo has vectors," not SHA-equality.** `repo_sha` is recorded as |
| 75 | + documented intent; strict snapshot-pinning is a follow-up. |
| 76 | +- **Empty results are flagged, not perfectly classified.** `search_v2` swallows errors |
| 77 | + and returns `[]`, so a true no-hit and a swallowed error look the same here; both are |
| 78 | + counted as misses and surfaced in `empty_query_ids` for investigation (FM-3). |
0 commit comments