fix: normalize indexed repo paths in eval runner, calibrate baseline by DevanshuNEU · Pull Request #321 · OpenCodeIntel/opencodeintel

DevanshuNEU · 2026-06-12T16:21:01Z

What

The retrieval-quality eval harness (#312) scored recall 0 across every query on its first calibration run. Root cause was a path-format mismatch, not a bad ranker:

Index stores: repos/<repo_id>/<path> (e.g. repos/78aa181e-.../backend/middleware/auth.py)
Ground-truth labels: repo-root-relative (backend/middleware/auth.py)

The two never string-matched, so ranx computed 0 recall even though search_v2 was returning the right files at rank 1.

Fix

Strip the exact repos/<repo_id>/ storage prefix in the runner before matching (_to_repo_relative). Exact match on repo_id so a path that merely starts with repos/ is never over-trimmed. queries.json stays repo-root-relative and portable across repo_ids (honors the OCI_EVAL_REPO_ID override design).

Calibration result

Ran against the OCI repo index over the 10-query ground-truth set:

Tier	recall@10	MRR
Free (core ranker)	0.80	0.80
Pro (Cohere rerank)	0.85	0.658

baseline.json now has the free-tier numbers with calibrated: true, so pytest evals/ -v asserts (recall@10 >= 0.75) instead of skipping. Verified passing locally.

Notes

Free-tier (deterministic, no Cohere) is the CI regression baseline, per the harness README.
Reranking is not strictly better on this set: +0.05 recall@10 but MRR drops 0.80 -> 0.658 (pulls one extra expected file into top-10 while demoting rank-1 hits). Worth a follow-up look at the pro-tier default.
Known ranker misses: q06 (durable repo-state) and q10 (path-filtering) — tracked separately, not addressed here.

Test plan

python -m evals returns real numbers (free + pro tiers)
pytest evals/ -v passes against the calibrated baseline

Summary by CodeRabbit

Bug Fixes
- Fixed file path matching in the evaluation system to correctly handle repository-relative paths across different repository contexts.
Chores
- Updated evaluation baseline with calibrated metric values to improve evaluation accuracy and provide regression detection thresholds.

search_v2 returns file_path as repos/<repo_id>/<path> but ground-truth labels are repo-root-relative, so the first calibration run scored recall 0 across every query. Strip the exact repos/<repo_id>/ prefix in the runner (keeps queries.json portable across repo_ids) and record the real free-tier baseline: recall@10 0.80, mrr 0.80. pytest gate now asserts instead of skipping.

vercel · 2026-06-12T16:21:06Z

@DevanshuNEU is attempting to deploy a commit to the Dev's projects Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2026-06-12T16:21:15Z

📝 Walkthrough

Walkthrough

This PR establishes calibrated baseline metrics for evaluation and implements repo-aware path normalization to align indexed file paths with expected ground-truth paths. The baseline transitions from placeholder nulls to concrete metric values, while the runner adds a helper function and updates deduplication logic to normalize paths consistently.

Changes

Evaluation calibration and path normalization

Layer / File(s)	Summary
Baseline calibration `backend/evals/baseline.json`	Baseline metrics updated from uncalibrated placeholders to concrete calibrated values for `free_core` and `pro_reranked` (including `recall@10` and `mrr`). The `calibrated` flag is set to `true` and the `note` field documents calibration date, index context, evaluation set size, CI regression tolerance, and observed metric tradeoffs.
Repo-aware path normalization `backend/evals/runner.py`	New `_to_repo_relative()` function strips the `repos/<repo_id>/` storage prefix from indexed file paths. Updated `_dedupe_files_by_rank()` signature to accept `repo_id` and normalize paths before deduplication. The `run_eval()` call site passes `repo_id` into deduplication to ensure ranked paths align with ground-truth expected paths.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A rabbit hops through metrics bright,
Where baselines glow with calibrated light,
Path normalization clears the way—
Evaluation takes its truest play! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes both key changes: path normalization in the eval runner and baseline calibration. It directly addresses the main objectives of fixing path-format mismatches and establishing calibrated evaluation metrics.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/evals/runner.py`:
- Around line 38-44: The _to_repo_relative function currently uses
str.startswith which violates Bug `#5`; change it to use pathlib.Path.parts tuple
comparison: convert file_path to a Path, build prefix_parts = ("repos",
repo_id), check whether path.parts starts with that tuple (compare
path.parts[:len(prefix_parts)] == prefix_parts) and, if so, return the remaining
path reconstructed from the trailing parts (using Path(...) or joining the
parts) otherwise return the original path string; ensure you import pathlib.Path
and preserve behavior (exact repo_id match, no over-trimming).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: de115166-8a47-46d7-99d0-90b7df28bf9d

📥 Commits

Reviewing files that changed from the base of the PR and between 1c5a935 and 640907d.

📒 Files selected for processing (2)

backend/evals/baseline.json
backend/evals/runner.py

vercel · 2026-06-12T16:37:40Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
opencodeintel	Ignored	Preview	Jun 12, 2026 4:37pm

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread backend/evals/runner.py

DevanshuNEU mentioned this pull request Jun 12, 2026

search: Cohere reranking (pro tier) regresses MRR on the retrieval eval #322

Open

DevanshuNEU merged commit 42d6b1c into OpenCodeIntel:main Jun 12, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: normalize indexed repo paths in eval runner, calibrate baseline#321

fix: normalize indexed repo paths in eval runner, calibrate baseline#321
DevanshuNEU merged 1 commit into
OpenCodeIntel:mainfrom
DevanshuNEU:fix/eval-path-normalization-calibration

DevanshuNEU commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

vercel Bot commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

vercel Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

DevanshuNEU commented Jun 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Fix

Calibration result

Notes

Test plan

Summary by CodeRabbit

Uh oh!

vercel Bot commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vercel Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DevanshuNEU commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

vercel Bot commented Jun 12, 2026 •

edited

Loading