Skip to content

feat: add GEditBench benchmark with task type subsets#518

Open
davidberenstein1957 wants to merge 10 commits intomainfrom
feat/add-geditbench-benchmark
Open

feat: add GEditBench benchmark with task type subsets#518
davidberenstein1957 wants to merge 10 commits intomainfrom
feat/add-geditbench-benchmark

Conversation

@davidberenstein1957
Copy link
Member

@davidberenstein1957 davidberenstein1957 commented Jan 31, 2026

Closes #511

Summary

  • Add GEditBench benchmark for image editing evaluation with 11 task type subsets
  • Fetch data from HuggingFace (stepfun-ai/GEdit-Bench), filter to English only
  • Support subset filtering: background_change, color_alter, material_alter, motion_change, ps_human, style_change, subject_add, subject_remove, subject_replace, text_change, tone_transfer

Usage

from pruna.data import PrunaDataModule

# Load all task types
dm = PrunaDataModule.from_string("GEditBench")

# Load specific task type
dm = PrunaDataModule.from_string("GEditBench", subset="background_change")

Test plan

  • PrunaDataModule.from_string("GEditBench") works
  • Subset filter works for all 11 task types
  • Auxiliaries include subset field
  • Docstring tests pass

davidberenstein1957 and others added 7 commits January 22, 2026 10:58
…mpts benchmark

- Introduced `from_benchmark` method in `PrunaDataModule` to create instances from benchmark classes.
- Added `Benchmark`, `BenchmarkEntry`, and `BenchmarkRegistry` classes for managing benchmarks.
- Implemented `PartiPrompts` benchmark for text-to-image generation with various categories and challenges.
- Created utility function `benchmark_to_datasets` to convert benchmarks into datasets compatible with `PrunaDataModule`.
- Added integration tests for benchmark functionality and data module interactions.
…filtering

- Remove heavy benchmark abstraction (Benchmark class, registry, adapter, 24 subclasses)
- Extend setup_parti_prompts_dataset with category and num_samples params
- Add BenchmarkInfo dataclass for metadata (metrics, description, subsets)
- Switch PartiPrompts to prompt_with_auxiliaries_collate to preserve Category/Challenge
- Merge tests into test_datamodule.py

Reduces 964 lines to 128 lines (87% reduction)

Co-authored-by: Cursor <cursoragent@cursor.com>
Document all dataclass fields per Numpydoc PR01 with summary on new line per GL01.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add list_benchmarks() to filter benchmarks by task type
- Add get_benchmark_info() to retrieve benchmark metadata
- Add COCO, ImageNet, WikiText to benchmark_info registry

Co-authored-by: Cursor <cursoragent@cursor.com>
Update benchmark metrics to match registered names:
- clip -> clip_score
- clip_iqa -> clipiqa
- Remove unimplemented top5_accuracy

Co-authored-by: Cursor <cursoragent@cursor.com>
Closes #510

- Add setup_imgedit_dataset in datasets/prompt.py
- Support subset filter (replace, add, remove, adjust, extract, style, background, compose)
- Fetch instructions and judge prompts from GitHub (PKU-YuanGroup/ImgEdit)
- Register ImgEdit in base_datasets
- Add BenchmarkInfo entry with accuracy metric, task_type image_edit
- Add test for loading with subset filter

Co-authored-by: Cursor <cursoragent@cursor.com>
Closes #511

- Add setup_gedit_dataset in datasets/prompt.py
- Support subset filter (11 task types including background_change, color_alter, etc.)
- Fetch data from HuggingFace (stepfun-ai/GEdit-Bench), filter English only
- Register GEditBench in base_datasets
- Add BenchmarkInfo entry with accuracy metric, task_type image_edit
- Add test for loading with subset filter

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

davidberenstein1957 and others added 3 commits January 31, 2026 17:21
… linting

- Rename subset parameter to category in setup_gedit_dataset
- Add empty dataset guard before ds.select([0])
- Fix line too long (E501) and trailing newline (W391) issues
- Update tests to use category parameter

Co-authored-by: Cursor <cursoragent@cursor.com>
- Rename subset to category in setup_imgedit_dataset for API consistency
- Add empty dataset guard to setup_imgedit_dataset
- Add empty dataset guard to setup_parti_prompts_dataset

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions
Copy link

This PR has been inactive for 10 days and is now marked as stale.

@github-actions github-actions bot added the stale label Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add GEditBench benchmark with task type subsets

1 participant