Skip to content

Comments

Unify error handling across all evaluators#82

Merged
Ki-Seki merged 3 commits intomainfrom
copilot/fix-errored-items-handling
Feb 11, 2026
Merged

Unify error handling across all evaluators#82
Ki-Seki merged 3 commits intomainfrom
copilot/fix-errored-items-handling

Conversation

Copy link
Contributor

Copilot AI commented Feb 10, 2026

Evaluators handled errored items inconsistently: MCQA manually calculated corrects / (evaluates - errors), PPL relied on -1 sentinels, Match used inline filtering, and CV created manual filtered lists. This created maintenance burden and potential for divergent behavior.

Changes

Base evaluator class (base.py):

  • Added _filter_non_error_items() helper for consistent error filtering
  • Updated _safe_average() to exclude errored items by default (via exclude_errors=True)

All evaluators (mcqa, ppl, match, cv):

  • Use centralized _filter_non_error_items() instead of ad-hoc filtering
  • Standardized error messages to str(e) (PPL was using repr(e))
  • Simplified calibrated accuracy calculations
# Before (inconsistent approaches)
non_error_items = [item for item in evaled_items if not item.error_msg]  # manual
valid_tags = sum(item.num_tags for item in evaled_items if not item.error_msg)  # inline
calibrated_accuracy = corrects / (evaluates - errors)  # manual math

# After (unified)
non_error_items = self._filter_non_error_items(evaled_items)
valid_tags = sum(item.num_tags for item in non_error_items)
calibrated_accuracy = corrects / len(non_error_items)

The _safe_average() default behavior change (exclude_errors=True) codifies existing implicit behavior—no evaluator was computing averages over errored items.

Original prompt

This section details on the original issue you should resolve

<issue_title>fix: unify the handling of errored items across all evaluators</issue_title>
<issue_description></issue_description>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: Ki-Seki <60967965+Ki-Seki@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix unification of errored item handling across evaluators Unify error handling across all evaluators Feb 10, 2026
Copilot AI requested a review from Ki-Seki February 10, 2026 00:47
@Ki-Seki Ki-Seki marked this pull request as ready for review February 11, 2026 02:48
Copilot AI review requested due to automatic review settings February 11, 2026 02:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR unifies error handling across all evaluators (MCQA, PPL, Match, CV) by introducing a centralized filtering helper and standardizing error-related calculations. Previously, each evaluator handled errored items differently using ad-hoc approaches (manual filtering, inline list comprehensions, manual arithmetic), creating maintenance burden and potential for inconsistency.

Changes:

  • Added _filter_non_error_items() helper method to BaseEvaluator for consistent error filtering
  • Updated _safe_average() to exclude errored items by default via new exclude_errors parameter
  • Refactored all evaluators to use the centralized helper instead of custom filtering logic

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
src/gimbench/base.py Added _filter_non_error_items() static helper method and updated _safe_average() with exclude_errors parameter defaulting to True
src/gimbench/mcqa/evaluators.py Replaced manual arithmetic (evaluates - errors) with len(non_error_items) for calibrated accuracy calculation
src/gimbench/match/evaluators.py Replaced inline filtering expressions with centralized _filter_non_error_items() helper for metrics computation
src/gimbench/cv/evaluators.py Replaced manual list comprehension with centralized _filter_non_error_items() helper

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Ki-Seki Ki-Seki merged commit 3bc233d into main Feb 11, 2026
9 checks passed
@Ki-Seki Ki-Seki deleted the copilot/fix-errored-items-handling branch February 11, 2026 03:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: unify the handling of errored items across all evaluators

2 participants