Skip to content

Update dependency datasets to v5#10

Open
renovate[bot] wants to merge 1 commit into
mainfrom
renovate/datasets-5.x
Open

Update dependency datasets to v5#10
renovate[bot] wants to merge 1 commit into
mainfrom
renovate/datasets-5.x

Conversation

@renovate

@renovate renovate Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

This PR contains the following updates:

Package Change Age Confidence
datasets ==4.8.5==5.0.0 age confidence

Release Notes

huggingface/datasets (datasets)

v5.0.0

Compare Source

Datasets Features
Agent traces
  • Parse Agent traces messages for SFT using teich by @​lhoestq in #​8232

    • Agent traces from claude_code/pi/codex and others can now be loaded with load_dataset
    • Using the teich library (new optional dependency), traces are parsed to messages to enable training on traces using e.g. trl
    • Load the data:
    >>> from datasets import load_dataset
    >>> ds = load_dataset("lhoestq/agent-traces-example", split="train")
    >>> ds[0]["messages"]
    [{'role': 'user', 'content': 'Download a random dataset from Hugging Face, use DuckDB to inspect it, and come back with a short report about it. Be concise and include: dataset name, what files/format you found, row count or rough size if you can determine it,...'
     ...]
    • Train on agent traces:
    trl sft --dataset-name lhoestq/agent-traces-example ...
Next-level shuffling in streaming mode
  • Use multiple input shards for shuffle buffer by @​lhoestq in #​8194

    ds = load_dataset(..., streaming=True)
    ds = ds.shuffle(seed=42)
    # or configure local buffer shuffling manually, default is:
    ds = ds.shuffle(seed=42, buffer_size=1000, max_buffer_input_shards=10)

    before👎: image

    after✨: image

    toy example comparison

    from datasets import IterableDataset
    
    ds = IterableDataset.from_dict({"i": range(123_456_789)}, num_shards=1024)
    ds = ds.shuffle(seed=42)
    
    print("Cold start ids:")
    print(list(ds.take(10)["i"]))
    print("Nominal regime ids:")
    print(list(ds.skip(10_000).take(10)["i"]))

    before👎:

    Cold start ids:
    [6148853, 6149537, 6149418, 6149202, 6149197, 6149622, 6148849, 6149461, 6148965, 6148858]
    Nominal regime ids:
    [6149537, 6149418, 6149202, 6149197, 6149622, 6148849, 6149461, 6148965, 6148858, 6149290]
    

    after✨:

    Cold start ids:
    [7836668, 9283505, 95847927, 482299, 9283471, 482341, 112003312, 59920157, 43764666, 95847871]
    Nominal regime ids:
    [9283505, 95847927, 482299, 9283471, 482341, 112003312, 59920157, 43764666, 95847871, 16758448]
    

    Note: ds.state_dict() and ds.load_state_dict() are still supported for this improved shuffling :) enabling dataset checkpointing

    Note 2: it uses threads to fetch the first examples in parallel from the input shards

    Note 3: This is a BREAKING CHANGE: the default shuffling mechanism now uses multiple input shards. You can get the old mechanism by passing max_buffer_input_shards=1 to IterableDataset.shuffle()

New batching features for robotics datasets
  • Add batch(by_column=...) by @​lhoestq in #​8172

    from datasets import Dataset
    
    ds = Dataset.from_dict({"episode": [0] * 10 + [1] * 10, "frame": list(range(10)) * 2})
    # ds = ds.to_iterable_dataset()
    ds = ds.batch(by_column="episode")
    for x in ds:
        print(x)
    # {'episode': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'frame': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}
    # {'episode': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'frame': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}
New supported formats
Other improvements and bug fixes
New Contributors

Full Changelog: huggingface/datasets@4.8.5...5.0.0


Configuration

📅 Schedule: (UTC)

  • Branch creation
    • At any time (no schedule defined)
  • Automerge
    • At any time (no schedule defined)

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants