Unify error handling across all evaluators by Copilot · Pull Request #82 · SculptAI/GIMBench

Copilot · 2026-02-10T00:38:37Z

Evaluators handled errored items inconsistently: MCQA manually calculated corrects / (evaluates - errors), PPL relied on -1 sentinels, Match used inline filtering, and CV created manual filtered lists. This created maintenance burden and potential for divergent behavior.

Changes

Base evaluator class (base.py):

Added _filter_non_error_items() helper for consistent error filtering
Updated _safe_average() to exclude errored items by default (via exclude_errors=True)

All evaluators (mcqa, ppl, match, cv):

Use centralized _filter_non_error_items() instead of ad-hoc filtering
Standardized error messages to str(e) (PPL was using repr(e))
Simplified calibrated accuracy calculations

# Before (inconsistent approaches)
non_error_items = [item for item in evaled_items if not item.error_msg]  # manual
valid_tags = sum(item.num_tags for item in evaled_items if not item.error_msg)  # inline
calibrated_accuracy = corrects / (evaluates - errors)  # manual math

# After (unified)
non_error_items = self._filter_non_error_items(evaled_items)
valid_tags = sum(item.num_tags for item in non_error_items)
calibrated_accuracy = corrects / len(non_error_items)

The _safe_average() default behavior change (exclude_errors=True) codifies existing implicit behavior—no evaluator was computing averages over errored items.

Original prompt

This section details on the original issue you should resolve

<issue_title>fix: unify the handling of errored items across all evaluators</issue_title>
<issue_description></issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes fix: unify the handling of errored items across all evaluators #54

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: Ki-Seki <60967965+Ki-Seki@users.noreply.github.com>

Copilot

Pull request overview

This PR unifies error handling across all evaluators (MCQA, PPL, Match, CV) by introducing a centralized filtering helper and standardizing error-related calculations. Previously, each evaluator handled errored items differently using ad-hoc approaches (manual filtering, inline list comprehensions, manual arithmetic), creating maintenance burden and potential for inconsistency.

Changes:

Added _filter_non_error_items() helper method to BaseEvaluator for consistent error filtering
Updated _safe_average() to exclude errored items by default via new exclude_errors parameter
Refactored all evaluators to use the centralized helper instead of custom filtering logic

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
src/gimbench/base.py	Added `_filter_non_error_items()` static helper method and updated `_safe_average()` with `exclude_errors` parameter defaulting to True
src/gimbench/mcqa/evaluators.py	Replaced manual arithmetic `(evaluates - errors)` with `len(non_error_items)` for calibrated accuracy calculation
src/gimbench/match/evaluators.py	Replaced inline filtering expressions with centralized `_filter_non_error_items()` helper for metrics computation
src/gimbench/cv/evaluators.py	Replaced manual list comprehension with centralized `_filter_non_error_items()` helper

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Initial plan

b9d9d3a

Copilot AI assigned Copilot and Ki-Seki Feb 10, 2026

Copilot started work on behalf of Ki-Seki February 10, 2026 00:39 View session

Unify error handling across all evaluators

75684c6

Co-authored-by: Ki-Seki <60967965+Ki-Seki@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix unification of errored item handling across evaluators~~ Unify error handling across all evaluators Feb 10, 2026

Copilot AI requested a review from Ki-Seki February 10, 2026 00:47

Copilot finished work on behalf of Ki-Seki February 10, 2026 00:47

fix: improve error message representation in PPLEvaluatorf

ae59d92

Ki-Seki marked this pull request as ready for review February 11, 2026 02:48

Copilot AI review requested due to automatic review settings February 11, 2026 02:48

Copilot started reviewing on behalf of Ki-Seki February 11, 2026 02:49 View session

Copilot AI reviewed Feb 11, 2026

View reviewed changes

Ki-Seki approved these changes Feb 11, 2026

View reviewed changes

Ki-Seki merged commit 3bc233d into main Feb 11, 2026
9 checks passed

Ki-Seki deleted the copilot/fix-errored-items-handling branch February 11, 2026 03:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Unify error handling across all evaluators#82

Unify error handling across all evaluators#82
Ki-Seki merged 3 commits intomainfrom
copilot/fix-errored-items-handling

Copilot AI commented Feb 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

Copilot AI commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Comments on the Issue (you are @copilot in this section)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Feb 10, 2026 •

edited

Loading