Skip to content

fix(eval): block self-judging rows for llm_judge#344

Open
ulises-jeremias wants to merge 2 commits into
IBM:mainfrom
ulises-jeremias:fix/eval-self-judge-guard
Open

fix(eval): block self-judging rows for llm_judge#344
ulises-jeremias wants to merge 2 commits into
IBM:mainfrom
ulises-jeremias:fix/eval-self-judge-guard

Conversation

@ulises-jeremias
Copy link
Copy Markdown

@ulises-jeremias ulises-jeremias commented Jun 3, 2026

Description

Prevent self-judging in offline evaluation when llm_judge is active.

This change adds a guard that rejects any evaluated row where the trajectory model matches --judge-model (with litellm_proxy/ prefix normalization). This avoids candidate=self-judge comparisons in the default evaluation path.

Type of Change

  • Bug fix

Impact on Benchmarking

  • No change to baselines: This change enforces evaluator safety checks and does not modify scoring logic itself.

Related Issues

Verification Steps

  1. uvx ruff check src/evaluation/evaluator.py src/evaluation/runner.py src/evaluation/cli.py src/evaluation/tests/test_evaluator.py
  2. uv run pytest src/evaluation/tests

Checklist

  • I have added tests that prove my fix is effective.
  • My code follows the project's Ruff formatting and linting rules.
  • I have signed off my commits (DCO).

Notes

  • Wired judge_model into Evaluator and evaluation.runner.evaluate(...) so the same guard applies to CLI and programmatic evaluation.
  • Added focused evaluator tests for exact match, normalized model-ID match, and non-LLM scorer bypass.
  • Updated evaluation docs and instructions to document the self-judging guard behavior.
  • Updated evaluation examples so default claude-agent candidate runs are no longer paired with the same judge model in docs.

Additional Quick Notes

  • Scope is intentionally narrow and isolated to evaluation-path safety checks; no runtime agent behavior or scoring rubric logic was changed.
  • The guard only runs when the active scorer is llm_judge and a judge_model is provided.
  • Comparison normalizes litellm_proxy/ to avoid false negatives with equivalent model IDs.

Signed-off-by: ulises-jeremias <ulisescf.24@gmail.com>
Signed-off-by: ulises-jeremias <ulisescf.24@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Default agent and judge models are identical — out-of-the-box self-evaluation

2 participants