fix: normalize indexed repo paths in eval runner, calibrate baseline#321
Conversation
search_v2 returns file_path as repos/<repo_id>/<path> but ground-truth labels are repo-root-relative, so the first calibration run scored recall 0 across every query. Strip the exact repos/<repo_id>/ prefix in the runner (keeps queries.json portable across repo_ids) and record the real free-tier baseline: recall@10 0.80, mrr 0.80. pytest gate now asserts instead of skipping.
|
@DevanshuNEU is attempting to deploy a commit to the Dev's projects Team on Vercel. A member of the Team first needs to authorize it. |
📝 WalkthroughWalkthroughThis PR establishes calibrated baseline metrics for evaluation and implements repo-aware path normalization to align indexed file paths with expected ground-truth paths. The baseline transitions from placeholder nulls to concrete metric values, while the runner adds a helper function and updates deduplication logic to normalize paths consistently. ChangesEvaluation calibration and path normalization
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/evals/runner.py`:
- Around line 38-44: The _to_repo_relative function currently uses
str.startswith which violates Bug `#5`; change it to use pathlib.Path.parts tuple
comparison: convert file_path to a Path, build prefix_parts = ("repos",
repo_id), check whether path.parts starts with that tuple (compare
path.parts[:len(prefix_parts)] == prefix_parts) and, if so, return the remaining
path reconstructed from the trailing parts (using Path(...) or joining the
parts) otherwise return the original path string; ensure you import pathlib.Path
and preserve behavior (exact repo_id match, no over-trimming).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: de115166-8a47-46d7-99d0-90b7df28bf9d
📒 Files selected for processing (2)
backend/evals/baseline.jsonbackend/evals/runner.py
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
What
The retrieval-quality eval harness (#312) scored
recall 0across every query on its first calibration run. Root cause was a path-format mismatch, not a bad ranker:repos/<repo_id>/<path>(e.g.repos/78aa181e-.../backend/middleware/auth.py)backend/middleware/auth.py)The two never string-matched, so
ranxcomputed 0 recall even thoughsearch_v2was returning the right files at rank 1.Fix
Strip the exact
repos/<repo_id>/storage prefix in the runner before matching (_to_repo_relative). Exact match onrepo_idso a path that merely starts withrepos/is never over-trimmed.queries.jsonstays repo-root-relative and portable across repo_ids (honors theOCI_EVAL_REPO_IDoverride design).Calibration result
Ran against the OCI repo index over the 10-query ground-truth set:
baseline.jsonnow has the free-tier numbers withcalibrated: true, sopytest evals/ -vasserts (recall@10 >= 0.75) instead of skipping. Verified passing locally.Notes
Test plan
python -m evalsreturns real numbers (free + pro tiers)pytest evals/ -vpasses against the calibrated baselineSummary by CodeRabbit
Bug Fixes
Chores