fix(eval): block self-judging rows for llm_judge by ulises-jeremias · Pull Request #344 · IBM/AssetOpsBench

ulises-jeremias · 2026-06-03T05:29:51Z

Description

Prevent self-judging in offline evaluation when llm_judge is active.

This change adds a guard that rejects any evaluated row where the trajectory model matches --judge-model (with litellm_proxy/ prefix normalization). This avoids candidate=self-judge comparisons in the default evaluation path.

Type of Change

Bug fix

Impact on Benchmarking

No change to baselines: This change enforces evaluator safety checks and does not modify scoring logic itself.

Related Issues

Fixes: Default agent and judge models are identical — out-of-the-box self-evaluation #336

Verification Steps

uvx ruff check src/evaluation/evaluator.py src/evaluation/runner.py src/evaluation/cli.py src/evaluation/tests/test_evaluator.py
uv run pytest src/evaluation/tests

Checklist

I have added tests that prove my fix is effective.
My code follows the project's Ruff formatting and linting rules.
I have signed off my commits (DCO).

Notes

Wired judge_model into Evaluator and evaluation.runner.evaluate(...) so the same guard applies to CLI and programmatic evaluation.
Added focused evaluator tests for exact match, normalized model-ID match, and non-LLM scorer bypass.
Updated evaluation docs and instructions to document the self-judging guard behavior.
Updated evaluation examples so default claude-agent candidate runs are no longer paired with the same judge model in docs.

Additional Quick Notes

Scope is intentionally narrow and isolated to evaluation-path safety checks; no runtime agent behavior or scoring rubric logic was changed.
The guard only runs when the active scorer is llm_judge and a judge_model is provided.
Comparison normalizes litellm_proxy/ to avoid false negatives with equivalent model IDs.

Signed-off-by: ulises-jeremias <ulisescf.24@gmail.com>

ulises-jeremias added 2 commits June 3, 2026 02:28

fix(eval): block self-judging rows for llm_judge

ab910f5

Signed-off-by: ulises-jeremias <ulisescf.24@gmail.com>

docs(eval): avoid self-judge model in examples

fbe1a65

Signed-off-by: ulises-jeremias <ulisescf.24@gmail.com>

DhavalRepo18 requested review from DhavalRepo18 and ShuxinLin June 3, 2026 10:50

DhavalRepo18 added the External contribution label Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): block self-judging rows for llm_judge#344

fix(eval): block self-judging rows for llm_judge#344
ulises-jeremias wants to merge 2 commits into
IBM:mainfrom
ulises-jeremias:fix/eval-self-judge-guard

ulises-jeremias commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ulises-jeremias commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Impact on Benchmarking

Related Issues

Verification Steps

Checklist

Notes

Additional Quick Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ulises-jeremias commented Jun 3, 2026 •

edited

Loading