feat: add GEditBench benchmark with task type subsets by davidberenstein1957 · Pull Request #518 · PrunaAI/pruna

davidberenstein1957 · 2026-01-31T16:05:41Z

Closes #511

Summary

Add GEditBench benchmark for image editing evaluation with 11 task type subsets
Fetch data from HuggingFace (stepfun-ai/GEdit-Bench), filter to English only
Support subset filtering: background_change, color_alter, material_alter, motion_change, ps_human, style_change, subject_add, subject_remove, subject_replace, text_change, tone_transfer

Usage

from pruna.data import PrunaDataModule

# Load all task types
dm = PrunaDataModule.from_string("GEditBench")

# Load specific task type
dm = PrunaDataModule.from_string("GEditBench", subset="background_change")

Test plan

PrunaDataModule.from_string("GEditBench") works
Subset filter works for all 11 task types
Auxiliaries include subset field
Docstring tests pass

…mpts benchmark - Introduced `from_benchmark` method in `PrunaDataModule` to create instances from benchmark classes. - Added `Benchmark`, `BenchmarkEntry`, and `BenchmarkRegistry` classes for managing benchmarks. - Implemented `PartiPrompts` benchmark for text-to-image generation with various categories and challenges. - Created utility function `benchmark_to_datasets` to convert benchmarks into datasets compatible with `PrunaDataModule`. - Added integration tests for benchmark functionality and data module interactions.

…filtering - Remove heavy benchmark abstraction (Benchmark class, registry, adapter, 24 subclasses) - Extend setup_parti_prompts_dataset with category and num_samples params - Add BenchmarkInfo dataclass for metadata (metrics, description, subsets) - Switch PartiPrompts to prompt_with_auxiliaries_collate to preserve Category/Challenge - Merge tests into test_datamodule.py Reduces 964 lines to 128 lines (87% reduction) Co-authored-by: Cursor <cursoragent@cursor.com>

Document all dataclass fields per Numpydoc PR01 with summary on new line per GL01. Co-authored-by: Cursor <cursoragent@cursor.com>

- Add list_benchmarks() to filter benchmarks by task type - Add get_benchmark_info() to retrieve benchmark metadata - Add COCO, ImageNet, WikiText to benchmark_info registry Co-authored-by: Cursor <cursoragent@cursor.com>

Update benchmark metrics to match registered names: - clip -> clip_score - clip_iqa -> clipiqa - Remove unimplemented top5_accuracy Co-authored-by: Cursor <cursoragent@cursor.com>

Closes #510 - Add setup_imgedit_dataset in datasets/prompt.py - Support subset filter (replace, add, remove, adjust, extract, style, background, compose) - Fetch instructions and judge prompts from GitHub (PKU-YuanGroup/ImgEdit) - Register ImgEdit in base_datasets - Add BenchmarkInfo entry with accuracy metric, task_type image_edit - Add test for loading with subset filter Co-authored-by: Cursor <cursoragent@cursor.com>

Closes #511 - Add setup_gedit_dataset in datasets/prompt.py - Support subset filter (11 task types including background_change, color_alter, etc.) - Fetch data from HuggingFace (stepfun-ai/GEdit-Bench), filter English only - Register GEditBench in base_datasets - Add BenchmarkInfo entry with accuracy metric, task_type image_edit - Add test for loading with subset filter Co-authored-by: Cursor <cursoragent@cursor.com>

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

tests/data/test_datamodule.py

src/pruna/data/datasets/prompt.py

… linting - Rename subset parameter to category in setup_gedit_dataset - Add empty dataset guard before ds.select([0]) - Fix line too long (E501) and trailing newline (W391) issues - Update tests to use category parameter Co-authored-by: Cursor <cursoragent@cursor.com>

- Rename subset to category in setup_imgedit_dataset for API consistency - Add empty dataset guard to setup_imgedit_dataset - Add empty dataset guard to setup_parti_prompts_dataset Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-02-13T00:13:00Z

This PR has been inactive for 10 days and is now marked as stale.

davidberenstein1957 and others added 7 commits January 22, 2026 10:58

fix: add Numpydoc parameter docs for BenchmarkInfo

975adb3

Document all dataclass fields per Numpydoc PR01 with summary on new line per GL01. Co-authored-by: Cursor <cursoragent@cursor.com>

fix: use correct metric names from MetricRegistry

56f2167

Update benchmark metrics to match registered names: - clip -> clip_score - clip_iqa -> clipiqa - Remove unimplemented top5_accuracy Co-authored-by: Cursor <cursoragent@cursor.com>

cursor bot reviewed Jan 31, 2026

View reviewed changes

tests/data/test_datamodule.py Outdated Show resolved Hide resolved

tests/data/test_datamodule.py Outdated Show resolved Hide resolved

src/pruna/data/datasets/prompt.py Show resolved Hide resolved

davidberenstein1957 and others added 3 commits January 31, 2026 17:21

fix: shorten ImgEdit description to fix line length linting

bb10cdf

Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions bot added the stale label Feb 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add GEditBench benchmark with task type subsets#518

feat: add GEditBench benchmark with task type subsets#518
davidberenstein1957 wants to merge 10 commits intomainfrom
feat/add-geditbench-benchmark

davidberenstein1957 commented Jan 31, 2026 •

edited

Loading

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidberenstein1957 commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

Test plan

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davidberenstein1957 commented Jan 31, 2026 •

edited

Loading