Skip to content

Datasets metadata#841

Open
ThomSerg wants to merge 6 commits intomasterfrom
datasets_metadata
Open

Datasets metadata#841
ThomSerg wants to merge 6 commits intomasterfrom
datasets_metadata

Conversation

@ThomSerg
Copy link
Copy Markdown
Collaborator

@ThomSerg ThomSerg commented Jan 30, 2026

Follow-up of #840, standardising instance-level metadata extraction and storage as .meta.json sidecar files.
(still also includes the changes from #840)

Instead of extracting metadata on the spot (during iteration over the dataset), this PR proposes to perform this ahead of time on initial download (since depending on what metadata you want to add, this can take some time). New datasets only have to implement

def collect_instance_metadata(self, file: str) -> dict:
    ... do your one-time processing of the file here ...

For each instance, metadata gets stored in a separate metadata sidecar file <instance_name_without_extensions>.meta.json

For example, for jsplib' instance abz5, abz5.meta.json:

{
  "name": "abz5",
  "jobs": 10,
  "machines": 10,
  "optimum": 1234,
  "bounds": {
    "upper": 1234,
    "lower": 1234
  }
}

This data was already getting collected before, from scratch on each access of the dataset. Now it is stored in a reusable metadata file.

@ThomSerg ThomSerg added the blocked Pull request blocked by another pull request/issue. label Jan 30, 2026
@ThomSerg
Copy link
Copy Markdown
Collaborator Author

Maybe we should already start thinking about a better structure for this metadata, so that in the future we can more easily connect it to standardised dataset metadata formats such as MLCommons' Croissant. These standards are only authorative at the dataset level, not individual instances, but we should still format it in a future-proof manner.

E.g.

# <instance>.meta.json
{
  "type": "instance-metadata",
  "dataset": "jsplib",
  "schema": "croissant.json#jsplib_instances",
  "instanceId": "abz5",

  "features": {
    "jobs": 10,
    "machines": 10,
    "optimum": 1234,
    "bounds": { "lower": 1234, "upper": 1234 }
  }
}

And then a snippet of a potential croissant description:

{
  "@type": "cr:RecordSet",
  "name": "jsplib_instances",
  "field": [
    { "@id": "jobs", "dataType": "Integer" },
    { "@id": "machines", "dataType": "Integer" },
    { "@id": "optimum", "dataType": "Integer" },
    {
      "@id": "bounds.lower",
      "dataType": "Integer"
    },
    {
      "@id": "bounds.upper",
      "dataType": "Integer"
    }
  ]
}

This would allow us in the future to provide access to datasets with metadata following well-accepted standards, promoting as they describe: "discoverability, portability, reproducibility, and responsible AI (RAI)"

@tias
Copy link
Copy Markdown
Collaborator

tias commented Feb 22, 2026

yes, can be good to make it at download time indeed; and following a standard is also fine if it is not too much extra work or code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocked Pull request blocked by another pull request/issue. needs discussion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants