feat: retrieval-quality eval harness (recall@10, MRR) for the live search_v2 ranker#318
Conversation
Offline, deterministic measurement of the live search_v2 ranker: recall@k, MRR, precision@k over a labeled query set. runner calls indexer.search_v2 in-process (cache off) so the number is repeatable; metrics are computed by ranx (bought, not hand-rolled -- recall/MRR math is a subtle-bug factory and a wrong metric is worse than no eval). Chose in-process over live-HTTP: the route cache makes HTTP nondeterministic, and recall/MRR are properties of the ranker, not the transport. Cohere reranking is pro-only, so the CLI records both tiers (free=no-rerank baseline, pro=Cohere) and prints the delta = what reranking is worth. ranx and the indexer singleton are imported inside functions, never at module top, so this never reaches the backend startup path (known-bug #3).
…nCodeIntel#312) pytest gate asserts recall@10 >= baseline - tolerance. Lives under evals/ (not tests/) because pytest.ini testpaths=tests and tests/conftest.py mocks Pinecone+OpenAI globally, which would force recall to 0; the gate needs the real index. Skips until baseline.json is calibrated, so it never asserts against a fake number. Ground truth: 10 agent-shaped queries labeled blind against OCI's own source. Empty search results are flagged and counted as misses, not hidden, since search_v2 swallows errors to [] (FM-3).
|
@DevanshuNEU is attempting to deploy a commit to the Dev's projects Team on Vercel. A member of the Team first needs to authorize it. |
|
Warning Review limit reached
More reviews will be available in 39 minutes and 16 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR introduces a complete retrieval-quality evaluation harness for the OCI ChangesRetrieval-Quality Evaluation Harness
Sequence DiagramsequenceDiagram
participant User
participant CLI as __main__
participant Runner
participant SearchAPI as search_v2
participant Metrics
participant Disk as Results JSON
User->>CLI: python -m evals [--free-only|--pro-only]
CLI->>Runner: run_eval(reranking)
Runner->>Runner: load_ground_truth()
Runner->>Runner: _preflight(repo)
Runner->>SearchAPI: search_v2(query, reranking)
SearchAPI->>Runner: hits
Runner->>Runner: _dedupe_files_by_rank()
Runner->>Metrics: compute_metrics(qrels, run)
Metrics->>Metrics: ranx.evaluate()
Metrics->>Runner: metrics dict
Runner->>Disk: write timestamped results
Disk->>CLI: output
CLI->>User: formatted metrics table
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
backend/evals/runner.py (2)
26-27: ⚡ Quick winAdd type hints for
pathparameter and make file I/O async.The
pathparameter is already typed in the signature, but the function performs synchronous file I/O (Path.read_text()) which can block. According to coding guidelines, "Useasync deffor any I/O operations. Sync only for pure CPU work." Additionally, this function is called from async contexts (line 73 inrun_eval).♻️ Refactor to use async file I/O
-def load_ground_truth(path: Path = GROUND_TRUTH_PATH) -> List[dict]: - return json.loads(path.read_text())["queries"] +async def load_ground_truth(path: Path = GROUND_TRUTH_PATH) -> List[dict]: + import asyncio + content = await asyncio.to_thread(path.read_text) + return json.loads(content)["queries"]Then update the call site at line 73:
- queries = load_ground_truth() + queries = await load_ground_truth()As per coding guidelines: "Use
async deffor any I/O operations. Sync only for pure CPU work."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/evals/runner.py` around lines 26 - 27, Convert load_ground_truth into an async function that performs non-blocking file I/O: change def load_ground_truth(path: Path = GROUND_TRUTH_PATH) to async def load_ground_truth(path: Path = GROUND_TRUTH_PATH) and replace the sync Path.read_text() usage with an async file-read (e.g., aiofiles.open/read or asyncio-compatible API) to parse JSON and return ["queries"]; then update the call site in run_eval (the invocation at line 73) to await load_ground_truth(...) so callers in async contexts use the new coroutine signature.Source: Coding guidelines
30-31: ⚡ Quick winMake file I/O async.
Same issue as
load_ground_truth: synchronous file I/O should be async per coding guidelines.♻️ Refactor to use async file I/O
-def load_baseline(path: Path = BASELINE_PATH) -> dict: - return json.loads(path.read_text()) +async def load_baseline(path: Path = BASELINE_PATH) -> dict: + import asyncio + content = await asyncio.to_thread(path.read_text) + return json.loads(content)Update call sites in
test_retrieval_quality.pyline 28 accordingly.As per coding guidelines: "Use
async deffor any I/O operations. Sync only for pure CPU work."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/evals/runner.py` around lines 30 - 31, Convert load_baseline to an async function (async def load_baseline(...)) and perform file I/O with an async reader (e.g., use aiofiles.open or another project-standard async file API) to read the file content and json.loads it before returning the dict; mirror the pattern used in load_ground_truth so callers are awaited, and update the call site in test_retrieval_quality.py (the invocation at line 28) to await load_baseline.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/evals/__main__.py`:
- Line 14: The parameter type for function _print_table is too broad; change the
annotation of out from dict to a specific type such as Dict[str, Any] and add
the required import from typing (Dict, Any) at the top of the module; update the
_print_table signature to def _print_table(label: str, out: Dict[str, Any]) ->
None: and ensure any callers still satisfy the tighter type contract.
In `@backend/evals/runner.py`:
- Line 141: The synchronous file write (Path.write_text) blocks the event loop
inside the async function; change it to run off the loop using asyncio.to_thread
so the write of json.dumps(out, indent=2) to (RESULTS_DIR /
f"eval_{tier}_{ts}.json") is performed in a thread; ensure asyncio is imported
and replace the direct write_text call with an awaited asyncio.to_thread call
that invokes the Path.write_text method with the JSON string so the async
function remains non-blocking.
- Line 45: The _preflight function's indexer parameter is missing a type hint;
update the signature of async def _preflight(indexer, repo_id: str) -> bool to
include the correct indexer type (e.g., the Indexer class or interface from the
dependencies module) by importing the type or using a forward
reference/typing.TYPE_CHECKING guard to avoid module-level initialization; if
you truly cannot determine the concrete type, annotate indexer as Any but
include an inline comment explaining why per guidelines (e.g., # type: Any --
runtime import cycle prevents precise typing) and ensure the annotation appears
in the function signature for _preflight.
- Line 61: The function signature for run_eval uses a vague return type `dict`;
update it to a more specific type like `Dict[str, Any]` and add the
corresponding import (`from typing import Dict, Any`) at the top of the module;
change the signature to async def run_eval(reranking: bool, repo_id_default: str
= DEFAULT_REPO_ID) -> Dict[str, Any]: and ensure any internal return values
conform to that shape (or refine to a more specific TypedDict/Model if
available).
In `@backend/evals/test_retrieval_quality.py`:
- Line 27: The test function test_recall_at_10_meets_baseline is missing a
return type annotation; update its signature to include an explicit return type
(-> None) so it matches the project's requirement for type hints on every
function signature—modify the def test_recall_at_10_meets_baseline(...) to def
test_recall_at_10_meets_baseline(...) -> None without changing the body.
---
Nitpick comments:
In `@backend/evals/runner.py`:
- Around line 26-27: Convert load_ground_truth into an async function that
performs non-blocking file I/O: change def load_ground_truth(path: Path =
GROUND_TRUTH_PATH) to async def load_ground_truth(path: Path =
GROUND_TRUTH_PATH) and replace the sync Path.read_text() usage with an async
file-read (e.g., aiofiles.open/read or asyncio-compatible API) to parse JSON and
return ["queries"]; then update the call site in run_eval (the invocation at
line 73) to await load_ground_truth(...) so callers in async contexts use the
new coroutine signature.
- Around line 30-31: Convert load_baseline to an async function (async def
load_baseline(...)) and perform file I/O with an async reader (e.g., use
aiofiles.open or another project-standard async file API) to read the file
content and json.loads it before returning the dict; mirror the pattern used in
load_ground_truth so callers are awaited, and update the call site in
test_retrieval_quality.py (the invocation at line 28) to await load_baseline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 221fb2ad-206e-4924-88c5-7859cf15f752
📒 Files selected for processing (10)
backend/evals/README.mdbackend/evals/__init__.pybackend/evals/__main__.pybackend/evals/baseline.jsonbackend/evals/ground_truth/queries.jsonbackend/evals/metrics.pybackend/evals/results/.gitignorebackend/evals/runner.pybackend/evals/test_retrieval_quality.pybackend/requirements-dev.txt
…OpenCodeIntel#312) - _print_table / run_eval: dict -> Dict[str, Any] (heterogeneous result dict; Any is intentional and commented per the backend no-bare-Any rule) - _preflight: type the indexer param via a TYPE_CHECKING guard so the precise type costs nothing at runtime (preserves known-bug #3 import isolation) - test: add -> None return annotation - results write: run Path.write_text off the event loop via asyncio.to_thread so run_eval stays non-blocking (backend async rule) Skipped the nitpick converting load_ground_truth/load_baseline to async aiofiles: one-time sub-ms config reads in an offline batch CLI with no loop concurrency; not worth a new runtime dep + cascading async into a sync test.
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
Railway assigns a dynamic $PORT and runs its healthcheck against it. The Dockerfile hardcoded --port 8000, so every deploy after healthcheckPath was added to railway.json failed with "service unavailable" on /health. #316 and #318 have both been stuck for days; prod only survived on the pre-healthcheck #293 image until that replica was knocked out. Bind ${PORT:-8000} (fallback for local/compose) and make the internal healthcheck read the same port.
Summary
Adds an offline, deterministic harness that measures the live
search_v2ranker: recall@5/10, MRR, precision@k over a human-labeled query set. This is OCI's first retrieval-quality instrument, so search changes (rerankers, embedding swaps, the v2 to v3 cutover) can finally be regression-gated instead of guessed at.Closes / implements
oci/decisions/2026-06-09-retrieval-quality-eval-harness-v0.1.md(Accepted, Option A; includes a 2026-06-10 implementation-note addendum for reality-driven deviations)feat/retrieval-eval-harness-v0.1What changed
Backend (
backend/...):backend/evals/runner.py- the engine: fail-closed pre-flight per repo, in-processindexer.search_v2per query (cache off), function-level hits collapsed to file-level keeping best rank, ranx qrels/run build, per-query breakdown + timestamped JSON results.backend/evals/metrics.py- thinranxwrapper (recall@5/10, precision@k, MRR); ranx lazy-imported with an actionable error (known-bug Feat/supabase auth #3 isolation).backend/evals/__main__.py-python -m evals; default records BOTH tiers (free = no-rerank baseline, pro = Cohere) and prints the delta.backend/requirements-dev.txt- new dev-only file pinningranx; prodrequirements.txtandconfig/startup_checks.pyuntouched.Tests / docs:
backend/evals/test_retrieval_quality.py- pytest regression gate (recall@10 >= baseline - tolerance); lives underevals/sopytest.ini testpaths=tests+ the global Pinecone/OpenAI mocks never touch it.backend/evals/ground_truth/queries.json- 10 agent-shaped queries labeled blind against OCI's own source.backend/evals/baseline.json- shipscalibrated: false; gate skips until a real number is recorded.backend/evals/README.md,backend/evals/results/.gitignore.ADR adherence
queries.json, converted to ranx in code - noted in ADR addendum)latency-budget.md); full-run wall-clock target onlyPipeline
/oci-pipeline- NOT run this session (went straight to PR by request); recommend running before merge/oci-defend- pending (next gate)testpaths=tests, so the existing suite is unaffected; CI will runpytest tests/on this PR.How to test
From
backend/with realOPENAI_API_KEY,PINECONE_API_KEY, and a populated index:pip install -r requirements-dev.txtexport OCI_EVAL_REPO_ID=<your-indexed-repo-id>python -m evals --free-onlyevals/results/; if recall is ~0 across all queries, it's a path-format mismatch (calibrate labels againstreturned_files), not a bad ranker.recall@10+mrrintobaseline.json, setcalibrated: true.pytest evals/ -v- the gate now enforces the baseline.Expected: a metrics table for both tiers, a timestamped JSON results file, and a reranking delta showing what the pro tier buys.
Deployment notes
OCI_EVAL_REPO_IDis dev-only for running the eval)auth.py,startup_checks.py,bun.lockbuntouched)mcppackage version change/api/v1/*contracts (read-only consumer ofsearch_v2)Tradeoff annotation (Paired-Ship Protocol)
ranx) rather than hand-rolling recall/MRR, because that math is a subtle-bug factory and a wrong metric makes the eval confidently wrong (worse than no eval).Risk and rollback
search_v2, config-excluded from the CI suite.Summary by CodeRabbit
Release Notes