Skip to content

feat: retrieval-quality eval harness (recall@10, MRR) for the live search_v2 ranker#318

Merged
DevanshuNEU merged 3 commits into
OpenCodeIntel:mainfrom
DevanshuNEU:feat/retrieval-eval-harness-v0.1
Jun 11, 2026
Merged

feat: retrieval-quality eval harness (recall@10, MRR) for the live search_v2 ranker#318
DevanshuNEU merged 3 commits into
OpenCodeIntel:mainfrom
DevanshuNEU:feat/retrieval-eval-harness-v0.1

Conversation

@DevanshuNEU

@DevanshuNEU DevanshuNEU commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds an offline, deterministic harness that measures the live search_v2 ranker: recall@5/10, MRR, precision@k over a human-labeled query set. This is OCI's first retrieval-quality instrument, so search changes (rerankers, embedding swaps, the v2 to v3 cutover) can finally be regression-gated instead of guessed at.

Closes / implements

What changed

Backend (backend/...):

  • backend/evals/runner.py - the engine: fail-closed pre-flight per repo, in-process indexer.search_v2 per query (cache off), function-level hits collapsed to file-level keeping best rank, ranx qrels/run build, per-query breakdown + timestamped JSON results.
  • backend/evals/metrics.py - thin ranx wrapper (recall@5/10, precision@k, MRR); ranx lazy-imported with an actionable error (known-bug Feat/supabase auth #3 isolation).
  • backend/evals/__main__.py - python -m evals; default records BOTH tiers (free = no-rerank baseline, pro = Cohere) and prints the delta.
  • backend/requirements-dev.txt - new dev-only file pinning ranx; prod requirements.txt and config/startup_checks.py untouched.

Tests / docs:

  • backend/evals/test_retrieval_quality.py - pytest regression gate (recall@10 >= baseline - tolerance); lives under evals/ so pytest.ini testpaths=tests + the global Pinecone/OpenAI mocks never touch it.
  • backend/evals/ground_truth/queries.json - 10 agent-shaped queries labeled blind against OCI's own source.
  • backend/evals/baseline.json - ships calibrated: false; gate skips until a real number is recorded.
  • backend/evals/README.md, backend/evals/results/.gitignore.

ADR adherence

  • Data model matches ADR "Data model" (qrels stored as one queries.json, converted to ranx in code - noted in ADR addendum)
  • Consistency guarantee matches ADR (read-against-pinned-snapshot via fail-closed pre-flight; skips loud on empty index)
  • Blast radius as predicted: purely additive (10 new files, 0 edits to existing code), zero user-facing surface
  • Failure modes guarded: FM-1 pre-flight, FM-3 empties flagged not hidden, FM-2 blind-label rule documented
  • Latency budget: n/a - explicitly off the agent hot path (per ADR + latency-budget.md); full-run wall-clock target only
  • All acceptance criteria: harness criteria met; "one real recall@10 + MRR number recorded" is pending the first calibration run (needs real creds + populated index - see How to test)

Pipeline

  • /oci-pipeline - NOT run this session (went straight to PR by request); recommend running before merge
  • CodeRabbit comments - pending (auto-review on this PR)
  • /oci-defend - pending (next gate)
  • Stack tests:
    • Backend: could NOT run locally (backend deps not installed in the session env). Change is additive + the eval gate is config-excluded via testpaths=tests, so the existing suite is unaffected; CI will run pytest tests/ on this PR.
    • Frontend / MCP: n/a - no changes.

How to test

From backend/ with real OPENAI_API_KEY, PINECONE_API_KEY, and a populated index:

  1. pip install -r requirements-dev.txt
  2. export OCI_EVAL_REPO_ID=<your-indexed-repo-id>
  3. python -m evals --free-only
  4. Inspect the per-query breakdown in evals/results/; if recall is ~0 across all queries, it's a path-format mismatch (calibrate labels against returned_files), not a bad ranker.
  5. Record the free-tier recall@10 + mrr into baseline.json, set calibrated: true.
  6. pytest evals/ -v - the gate now enforces the baseline.

Expected: a metrics table for both tiers, a timestamped JSON results file, and a reranking delta showing what the pro tier buys.

Deployment notes

  • No new env vars required for prod (OCI_EVAL_REPO_ID is dev-only for running the eval)
  • No DB migration required
  • No off-limits files modified (auth.py, startup_checks.py, bun.lockb untouched)
  • No mcp package version change
  • No breaking change to /api/v1/* contracts (read-only consumer of search_v2)

Tradeoff annotation (Paired-Ship Protocol)

  • We chose in-process measurement over a live-HTTP endpoint eval because the route cache makes HTTP nondeterministic, and recall/MRR are properties of the ranker, not the transport.
  • We rejected LLM-as-judge relevance because it is nondeterministic and costs tokens per run - it would forfeit the one thing the instrument exists to give: a number you can trust and compare over time.
  • We bought the metric math (ranx) rather than hand-rolling recall/MRR, because that math is a subtle-bug factory and a wrong metric makes the eval confidently wrong (worse than no eval).
  • Consistency tradeoff: strengthened (fully reproducible) vs the HTTP alternative. Operational tradeoff: easier to debug (per-query breakdown), but the v0.1 gate runs locally, not network-free in CI (deferred).

Risk and rollback

  • Symptom: none user-facing. Worst case is a misleading retrieval number that misleads a maintainer.
  • Blast radius: zero - additive new module, read-only consumer of search_v2, config-excluded from the CI suite.
  • Rollback: revert the PR. No data, no migration, no env.
  • Time-to-rollback: < 5 min.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added a retrieval-quality evaluation framework for measuring search performance offline and deterministically.
    • Supports two evaluation tiers: a free tier using the core ranker and a pro tier with advanced reranking capabilities.
    • Includes a regression detection system that establishes baseline metrics and validates search quality against defined thresholds.
    • Provides CLI tooling to run evaluations and compare performance between tiers.

Offline, deterministic measurement of the live search_v2 ranker: recall@k,
MRR, precision@k over a labeled query set. runner calls indexer.search_v2
in-process (cache off) so the number is repeatable; metrics are computed by
ranx (bought, not hand-rolled -- recall/MRR math is a subtle-bug factory and
a wrong metric is worse than no eval).

Chose in-process over live-HTTP: the route cache makes HTTP nondeterministic,
and recall/MRR are properties of the ranker, not the transport. Cohere
reranking is pro-only, so the CLI records both tiers (free=no-rerank baseline,
pro=Cohere) and prints the delta = what reranking is worth.

ranx and the indexer singleton are imported inside functions, never at module
top, so this never reaches the backend startup path (known-bug #3).
…nCodeIntel#312)

pytest gate asserts recall@10 >= baseline - tolerance. Lives under evals/
(not tests/) because pytest.ini testpaths=tests and tests/conftest.py mocks
Pinecone+OpenAI globally, which would force recall to 0; the gate needs the
real index. Skips until baseline.json is calibrated, so it never asserts
against a fake number.

Ground truth: 10 agent-shaped queries labeled blind against OCI's own source.
Empty search results are flagged and counted as misses, not hidden, since
search_v2 swallows errors to [] (FM-3).
@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

@DevanshuNEU is attempting to deploy a commit to the Dev's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@DevanshuNEU, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 39 minutes and 16 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 36ff5874-693d-4d45-9346-10527f7dc845

📥 Commits

Reviewing files that changed from the base of the PR and between 1a24f31 and db87f0a.

📒 Files selected for processing (3)
  • backend/evals/__main__.py
  • backend/evals/runner.py
  • backend/evals/test_retrieval_quality.py
📝 Walkthrough

Walkthrough

This PR introduces a complete retrieval-quality evaluation harness for the OCI search_v2 ranker. It includes 10 labeled ground-truth queries, metric computation (recall@k, MRR, precision@k), an offline evaluation runner, a CLI entry point supporting free (core ranker) and pro (Cohere reranking) tiers, and a pytest regression gate against a calibrated baseline.

Changes

Retrieval-Quality Evaluation Harness

Layer / File(s) Summary
Documentation and module initialization
backend/evals/README.md, backend/evals/__init__.py, backend/evals/results/.gitignore
README documents eval harness purpose, tier configuration, query-labeling rules, calibration workflow, and v0.1 limitations; module docstring describes the offline deterministic approach; results directory ignores artifacts while preserving baseline.json.
Ground truth queries and baseline metrics
backend/evals/ground_truth/queries.json, backend/evals/baseline.json
10 labeled agent-shaped queries with expected file targets and explanatory notes; placeholder baseline with calibrated=false, tolerance=0.05, and metric fields set to null pending first real index run.
Metric computation library
backend/evals/metrics.py, backend/requirements-dev.txt
compute_metrics function lazily imports ranx and evaluates recall@k, precision@k, and MRR; raises custom ModuleNotFoundError if ranx not installed; dev dependency on ranx>=0.3.20.
Evaluation runner and orchestration
backend/evals/runner.py
Loads ground truth and baseline; preflights repos to detect missing vectors; executes queries via live search_v2 with optional reranking; dedupes function-level hits to file-level results preserving best rank; computes aggregate metrics via compute_metrics; writes timestamped JSON with per-query details, metric values, and skip/empty-result counts.
CLI interface and result presentation
backend/evals/__main__.py
main() parses mutually-exclusive --free-only and --pro-only flags; runs one or both tiers; formats results table showing tier, reranking flag, query counts, metrics, and optional empty-query notes; computes recall@10 delta when both tiers run; suppresses pro-tier RuntimeError with [PRO TIER SKIPPED] message.
Regression test gate
backend/evals/test_retrieval_quality.py
Module-level pytest skipped unless OPENAI_API_KEY and PINECONE_API_KEY are set; loads baseline, skips if not calibrated; runs free-tier run_eval asynchronously; asserts recall@10 meets baseline value minus tolerance.

Sequence Diagram

sequenceDiagram
  participant User
  participant CLI as __main__
  participant Runner
  participant SearchAPI as search_v2
  participant Metrics
  participant Disk as Results JSON
  User->>CLI: python -m evals [--free-only|--pro-only]
  CLI->>Runner: run_eval(reranking)
  Runner->>Runner: load_ground_truth()
  Runner->>Runner: _preflight(repo)
  Runner->>SearchAPI: search_v2(query, reranking)
  SearchAPI->>Runner: hits
  Runner->>Runner: _dedupe_files_by_rank()
  Runner->>Metrics: compute_metrics(qrels, run)
  Metrics->>Metrics: ranx.evaluate()
  Metrics->>Runner: metrics dict
  Runner->>Disk: write timestamped results
  Disk->>CLI: output
  CLI->>User: formatted metrics table
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A harness to measure how well we retrieve,
Ground truth and baselines to help us believe,
Free tier and pro tier, both ranked and compared,
The search_v2 ranker, now fairly declared! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: adding a retrieval-quality eval harness with specific metrics (recall@10, MRR) for the search_v2 ranker.
Linked Issues check ✅ Passed The PR implements all v0.1 scope requirements: ground-truth queries (10 provided), runner computing recall@k/MRR/precision@k, reproducible metrics table via CLI, and README documentation.
Out of Scope Changes check ✅ Passed All changes directly support the eval harness implementation; no external baseline comparisons, public benchmarks, or unrelated modifications present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
backend/evals/runner.py (2)

26-27: ⚡ Quick win

Add type hints for path parameter and make file I/O async.

The path parameter is already typed in the signature, but the function performs synchronous file I/O (Path.read_text()) which can block. According to coding guidelines, "Use async def for any I/O operations. Sync only for pure CPU work." Additionally, this function is called from async contexts (line 73 in run_eval).

♻️ Refactor to use async file I/O
-def load_ground_truth(path: Path = GROUND_TRUTH_PATH) -> List[dict]:
-    return json.loads(path.read_text())["queries"]
+async def load_ground_truth(path: Path = GROUND_TRUTH_PATH) -> List[dict]:
+    import asyncio
+    content = await asyncio.to_thread(path.read_text)
+    return json.loads(content)["queries"]

Then update the call site at line 73:

-    queries = load_ground_truth()
+    queries = await load_ground_truth()

As per coding guidelines: "Use async def for any I/O operations. Sync only for pure CPU work."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/evals/runner.py` around lines 26 - 27, Convert load_ground_truth into
an async function that performs non-blocking file I/O: change def
load_ground_truth(path: Path = GROUND_TRUTH_PATH) to async def
load_ground_truth(path: Path = GROUND_TRUTH_PATH) and replace the sync
Path.read_text() usage with an async file-read (e.g., aiofiles.open/read or
asyncio-compatible API) to parse JSON and return ["queries"]; then update the
call site in run_eval (the invocation at line 73) to await
load_ground_truth(...) so callers in async contexts use the new coroutine
signature.

Source: Coding guidelines


30-31: ⚡ Quick win

Make file I/O async.

Same issue as load_ground_truth: synchronous file I/O should be async per coding guidelines.

♻️ Refactor to use async file I/O
-def load_baseline(path: Path = BASELINE_PATH) -> dict:
-    return json.loads(path.read_text())
+async def load_baseline(path: Path = BASELINE_PATH) -> dict:
+    import asyncio
+    content = await asyncio.to_thread(path.read_text)
+    return json.loads(content)

Update call sites in test_retrieval_quality.py line 28 accordingly.

As per coding guidelines: "Use async def for any I/O operations. Sync only for pure CPU work."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/evals/runner.py` around lines 30 - 31, Convert load_baseline to an
async function (async def load_baseline(...)) and perform file I/O with an async
reader (e.g., use aiofiles.open or another project-standard async file API) to
read the file content and json.loads it before returning the dict; mirror the
pattern used in load_ground_truth so callers are awaited, and update the call
site in test_retrieval_quality.py (the invocation at line 28) to await
load_baseline.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/evals/__main__.py`:
- Line 14: The parameter type for function _print_table is too broad; change the
annotation of out from dict to a specific type such as Dict[str, Any] and add
the required import from typing (Dict, Any) at the top of the module; update the
_print_table signature to def _print_table(label: str, out: Dict[str, Any]) ->
None: and ensure any callers still satisfy the tighter type contract.

In `@backend/evals/runner.py`:
- Line 141: The synchronous file write (Path.write_text) blocks the event loop
inside the async function; change it to run off the loop using asyncio.to_thread
so the write of json.dumps(out, indent=2) to (RESULTS_DIR /
f"eval_{tier}_{ts}.json") is performed in a thread; ensure asyncio is imported
and replace the direct write_text call with an awaited asyncio.to_thread call
that invokes the Path.write_text method with the JSON string so the async
function remains non-blocking.
- Line 45: The _preflight function's indexer parameter is missing a type hint;
update the signature of async def _preflight(indexer, repo_id: str) -> bool to
include the correct indexer type (e.g., the Indexer class or interface from the
dependencies module) by importing the type or using a forward
reference/typing.TYPE_CHECKING guard to avoid module-level initialization; if
you truly cannot determine the concrete type, annotate indexer as Any but
include an inline comment explaining why per guidelines (e.g., # type: Any --
runtime import cycle prevents precise typing) and ensure the annotation appears
in the function signature for _preflight.
- Line 61: The function signature for run_eval uses a vague return type `dict`;
update it to a more specific type like `Dict[str, Any]` and add the
corresponding import (`from typing import Dict, Any`) at the top of the module;
change the signature to async def run_eval(reranking: bool, repo_id_default: str
= DEFAULT_REPO_ID) -> Dict[str, Any]: and ensure any internal return values
conform to that shape (or refine to a more specific TypedDict/Model if
available).

In `@backend/evals/test_retrieval_quality.py`:
- Line 27: The test function test_recall_at_10_meets_baseline is missing a
return type annotation; update its signature to include an explicit return type
(-> None) so it matches the project's requirement for type hints on every
function signature—modify the def test_recall_at_10_meets_baseline(...) to def
test_recall_at_10_meets_baseline(...) -> None without changing the body.

---

Nitpick comments:
In `@backend/evals/runner.py`:
- Around line 26-27: Convert load_ground_truth into an async function that
performs non-blocking file I/O: change def load_ground_truth(path: Path =
GROUND_TRUTH_PATH) to async def load_ground_truth(path: Path =
GROUND_TRUTH_PATH) and replace the sync Path.read_text() usage with an async
file-read (e.g., aiofiles.open/read or asyncio-compatible API) to parse JSON and
return ["queries"]; then update the call site in run_eval (the invocation at
line 73) to await load_ground_truth(...) so callers in async contexts use the
new coroutine signature.
- Around line 30-31: Convert load_baseline to an async function (async def
load_baseline(...)) and perform file I/O with an async reader (e.g., use
aiofiles.open or another project-standard async file API) to read the file
content and json.loads it before returning the dict; mirror the pattern used in
load_ground_truth so callers are awaited, and update the call site in
test_retrieval_quality.py (the invocation at line 28) to await load_baseline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 221fb2ad-206e-4924-88c5-7859cf15f752

📥 Commits

Reviewing files that changed from the base of the PR and between ca451f0 and 1a24f31.

📒 Files selected for processing (10)
  • backend/evals/README.md
  • backend/evals/__init__.py
  • backend/evals/__main__.py
  • backend/evals/baseline.json
  • backend/evals/ground_truth/queries.json
  • backend/evals/metrics.py
  • backend/evals/results/.gitignore
  • backend/evals/runner.py
  • backend/evals/test_retrieval_quality.py
  • backend/requirements-dev.txt

Comment thread backend/evals/__main__.py Outdated
Comment thread backend/evals/runner.py Outdated
Comment thread backend/evals/runner.py Outdated
Comment thread backend/evals/runner.py Outdated
Comment thread backend/evals/test_retrieval_quality.py Outdated
…OpenCodeIntel#312)

- _print_table / run_eval: dict -> Dict[str, Any] (heterogeneous result dict;
  Any is intentional and commented per the backend no-bare-Any rule)
- _preflight: type the indexer param via a TYPE_CHECKING guard so the precise
  type costs nothing at runtime (preserves known-bug #3 import isolation)
- test: add -> None return annotation
- results write: run Path.write_text off the event loop via asyncio.to_thread
  so run_eval stays non-blocking (backend async rule)

Skipped the nitpick converting load_ground_truth/load_baseline to async aiofiles:
one-time sub-ms config reads in an offline batch CLI with no loop concurrency;
not worth a new runtime dep + cascading async into a sync test.
@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
opencodeintel Ignored Ignored Preview Jun 11, 2026 5:35am

@DevanshuNEU DevanshuNEU merged commit 022ef14 into OpenCodeIntel:main Jun 11, 2026
8 checks passed
DevanshuNEU added a commit that referenced this pull request Jun 11, 2026
Railway assigns a dynamic $PORT and runs its healthcheck against it. The
Dockerfile hardcoded --port 8000, so every deploy after healthcheckPath was
added to railway.json failed with "service unavailable" on /health. #316 and
#318 have both been stuck for days; prod only survived on the pre-healthcheck
#293 image until that replica was knocked out. Bind ${PORT:-8000} (fallback for
local/compose) and make the internal healthcheck read the same port.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: retrieval-quality eval harness (recall@10, MRR) on a labeled query set

1 participant