Skip to content

Update pyarrow dependency version constraint#239

Open
gioelemo wants to merge 1 commit intoIDEALLab:mainfrom
gioelemo:patch-1
Open

Update pyarrow dependency version constraint#239
gioelemo wants to merge 1 commit intoIDEALLab:mainfrom
gioelemo:patch-1

Conversation

@gioelemo
Copy link
Copy Markdown

Description

Summary

  • Remove the pyarrow < 20.0.0 upper bound from dependencies

Motivation

The current pin (pyarrow >= 15.0.0, < 20.0.0) was added because HuggingFace datasets hadn't migrated to pyarrow 20's breaking changes yet. That migration is now complete:

  • datasets >= 3.5.1 (April 2025): first version to support pyarrow 20+
  • datasets >= 4.1.0 (September 2025): pyarrow >= 21 is now required

With the current cap in place, pip install engibench resolves to datasets 4.0.0 + pyarrow 19.x. However, datasets 4.0.0 has a known bug (huggingface/datasets#8085) where using torch-formatted datasets crashes with:

ImportError: cannot import name 'VideoReader' from 'torchvision.io'

This is fixed in datasets==4.8.4, but upgrading datasets pulls pyarrow >= 23, which conflicts with EngiBench's < 20 cap. Users are forced to either live with the crash or install incompatible versions and ignore pip warnings.

In practice, this means a fresh pip install engibench produces an environment with the new PyTorch 2.11 + torchvision 0.26.0 where any training script using torch-formatted datasets will fail out of the box.

References:

Change

- "pyarrow >= 15.0.0, < 20.0.0", # HF datasets not migrated to pyarrow 20.0.0 yet
+ "pyarrow >= 15.0.0",

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • I have run the pre-commit checks with pre-commit run --all-files
  • I have run ruff check . and ruff format
  • I have run mypy .
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Reviewer Checklist:

  • The content of this PR brings value to the community. It is not too specific to a particular use case.
  • The tests and checks pass (linting, formatting, type checking). For a new problem, double check the github actions workflow to ensure the problem is being tested.
  • The documentation is updated.
  • The code is understandable and commented. No large code blocks are left unexplained, no huge file. Can I read and understand the code easily?
  • There is no merge conflict.
  • The changes are not breaking the existing results (datasets, training curves, etc.). If they do, is there a good reason for it? And is the associated problem version bumped?
  • For a new problem, has the dataset been generated with our slurm script so we can re-generate it if needed? (This also ensures that the problem is running on the HPC.)
  • For bugfixes, it is a robust fix and not a hacky workaround.

Removed upper limit on pyarrow version in dependencies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant