Skip to content

fix: normalize indexed repo paths in eval runner, calibrate baseline#321

Merged
DevanshuNEU merged 1 commit into
OpenCodeIntel:mainfrom
DevanshuNEU:fix/eval-path-normalization-calibration
Jun 12, 2026
Merged

fix: normalize indexed repo paths in eval runner, calibrate baseline#321
DevanshuNEU merged 1 commit into
OpenCodeIntel:mainfrom
DevanshuNEU:fix/eval-path-normalization-calibration

Conversation

@DevanshuNEU

@DevanshuNEU DevanshuNEU commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

What

The retrieval-quality eval harness (#312) scored recall 0 across every query on its first calibration run. Root cause was a path-format mismatch, not a bad ranker:

  • Index stores: repos/<repo_id>/<path> (e.g. repos/78aa181e-.../backend/middleware/auth.py)
  • Ground-truth labels: repo-root-relative (backend/middleware/auth.py)

The two never string-matched, so ranx computed 0 recall even though search_v2 was returning the right files at rank 1.

Fix

Strip the exact repos/<repo_id>/ storage prefix in the runner before matching (_to_repo_relative). Exact match on repo_id so a path that merely starts with repos/ is never over-trimmed. queries.json stays repo-root-relative and portable across repo_ids (honors the OCI_EVAL_REPO_ID override design).

Calibration result

Ran against the OCI repo index over the 10-query ground-truth set:

Tier recall@10 MRR
Free (core ranker) 0.80 0.80
Pro (Cohere rerank) 0.85 0.658

baseline.json now has the free-tier numbers with calibrated: true, so pytest evals/ -v asserts (recall@10 >= 0.75) instead of skipping. Verified passing locally.

Notes

  • Free-tier (deterministic, no Cohere) is the CI regression baseline, per the harness README.
  • Reranking is not strictly better on this set: +0.05 recall@10 but MRR drops 0.80 -> 0.658 (pulls one extra expected file into top-10 while demoting rank-1 hits). Worth a follow-up look at the pro-tier default.
  • Known ranker misses: q06 (durable repo-state) and q10 (path-filtering) — tracked separately, not addressed here.

Test plan

  • python -m evals returns real numbers (free + pro tiers)
  • pytest evals/ -v passes against the calibrated baseline

Summary by CodeRabbit

  • Bug Fixes

    • Fixed file path matching in the evaluation system to correctly handle repository-relative paths across different repository contexts.
  • Chores

    • Updated evaluation baseline with calibrated metric values to improve evaluation accuracy and provide regression detection thresholds.

search_v2 returns file_path as repos/<repo_id>/<path> but ground-truth
labels are repo-root-relative, so the first calibration run scored
recall 0 across every query. Strip the exact repos/<repo_id>/ prefix in
the runner (keeps queries.json portable across repo_ids) and record the
real free-tier baseline: recall@10 0.80, mrr 0.80. pytest gate now
asserts instead of skipping.
@vercel

vercel Bot commented Jun 12, 2026

Copy link
Copy Markdown

@DevanshuNEU is attempting to deploy a commit to the Dev's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR establishes calibrated baseline metrics for evaluation and implements repo-aware path normalization to align indexed file paths with expected ground-truth paths. The baseline transitions from placeholder nulls to concrete metric values, while the runner adds a helper function and updates deduplication logic to normalize paths consistently.

Changes

Evaluation calibration and path normalization

Layer / File(s) Summary
Baseline calibration
backend/evals/baseline.json
Baseline metrics updated from uncalibrated placeholders to concrete calibrated values for free_core and pro_reranked (including recall@10 and mrr). The calibrated flag is set to true and the note field documents calibration date, index context, evaluation set size, CI regression tolerance, and observed metric tradeoffs.
Repo-aware path normalization
backend/evals/runner.py
New _to_repo_relative() function strips the repos/<repo_id>/ storage prefix from indexed file paths. Updated _dedupe_files_by_rank() signature to accept repo_id and normalize paths before deduplication. The run_eval() call site passes repo_id into deduplication to ensure ranked paths align with ground-truth expected paths.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A rabbit hops through metrics bright,
Where baselines glow with calibrated light,
Path normalization clears the way—
Evaluation takes its truest play! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes both key changes: path normalization in the eval runner and baseline calibration. It directly addresses the main objectives of fixing path-format mismatches and establishing calibrated evaluation metrics.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/evals/runner.py`:
- Around line 38-44: The _to_repo_relative function currently uses
str.startswith which violates Bug `#5`; change it to use pathlib.Path.parts tuple
comparison: convert file_path to a Path, build prefix_parts = ("repos",
repo_id), check whether path.parts starts with that tuple (compare
path.parts[:len(prefix_parts)] == prefix_parts) and, if so, return the remaining
path reconstructed from the trailing parts (using Path(...) or joining the
parts) otherwise return the original path string; ensure you import pathlib.Path
and preserve behavior (exact repo_id match, no over-trimming).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: de115166-8a47-46d7-99d0-90b7df28bf9d

📥 Commits

Reviewing files that changed from the base of the PR and between 1c5a935 and 640907d.

📒 Files selected for processing (2)
  • backend/evals/baseline.json
  • backend/evals/runner.py

Comment thread backend/evals/runner.py
@vercel

vercel Bot commented Jun 12, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
opencodeintel Ignored Ignored Preview Jun 12, 2026 4:37pm

@DevanshuNEU DevanshuNEU merged commit 42d6b1c into OpenCodeIntel:main Jun 12, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant