-
Notifications
You must be signed in to change notification settings - Fork 61
Open
Labels
type:performanceMake things lean and fastMake things lean and fast
Description
Hello,
I would like to use huggingface datasets with grain, but they seem to be incredibly slow, much slower than simply loading all the data in memory as a numpy array and passing that as a data source.
I prepared a gist: https://colab.research.google.com/gist/aurelio-amerio/1d9454132e6de94123dd6691f764c8db/hf_grain_dataset.ipynb
Why is a grain dataset obtained from a huggingface dataset so much slower? Could it be somehow related to this: https://huggingface.co/docs/datasets/v4.2.0/about_mapstyle_vs_iterable#speed-differences ?
I don't know if there's something wrong I'm doing, any help would be greatly appreciated. Thank you!
Metadata
Metadata
Assignees
Labels
type:performanceMake things lean and fastMake things lean and fast