grain dataset slow with huggingface data source

Hello,
I would like to use huggingface datasets with grain, but they seem to be incredibly slow, much slower than simply loading all the data in memory as a numpy array and passing that as a data source. 

I prepared a gist: https://colab.research.google.com/gist/aurelio-amerio/1d9454132e6de94123dd6691f764c8db/hf_grain_dataset.ipynb

Why is a grain dataset obtained from a huggingface dataset so much slower? Could it be somehow related to this: https://huggingface.co/docs/datasets/v4.2.0/about_mapstyle_vs_iterable#speed-differences ? 

I don't know if there's something wrong I'm doing, any help would be greatly appreciated. Thank you!







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

grain dataset slow with huggingface data source #1084

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

grain dataset slow with huggingface data source #1084

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions