Datasets metadata by ThomSerg · Pull Request #841 · CPMpy/cpmpy

ThomSerg · 2026-01-30T12:21:58Z

Follow-up of #840, standardising instance-level metadata extraction and storage as .meta.json sidecar files.
(still also includes the changes from #840)

Instead of extracting metadata on the spot (during iteration over the dataset), this PR proposes to perform this ahead of time on initial download (since depending on what metadata you want to add, this can take some time). New datasets only have to implement

def collect_instance_metadata(self, file: str) -> dict:
    ... do your one-time processing of the file here ...

For each instance, metadata gets stored in a separate metadata sidecar file <instance_name_without_extensions>.meta.json

For example, for jsplib' instance abz5, abz5.meta.json:

{
  "name": "abz5",
  "jobs": 10,
  "machines": 10,
  "optimum": 1234,
  "bounds": {
    "upper": 1234,
    "lower": 1234
  }
}

This data was already getting collected before, from scratch on each access of the dataset. Now it is stored in a reusable metadata file.

This reverts commit 3f10a94.

ThomSerg · 2026-01-30T14:31:28Z

Maybe we should already start thinking about a better structure for this metadata, so that in the future we can more easily connect it to standardised dataset metadata formats such as MLCommons' Croissant. These standards are only authorative at the dataset level, not individual instances, but we should still format it in a future-proof manner.

E.g.

# <instance>.meta.json
{
  "type": "instance-metadata",
  "dataset": "jsplib",
  "schema": "croissant.json#jsplib_instances",
  "instanceId": "abz5",

  "features": {
    "jobs": 10,
    "machines": 10,
    "optimum": 1234,
    "bounds": { "lower": 1234, "upper": 1234 }
  }
}

And then a snippet of a potential croissant description:

{
  "@type": "cr:RecordSet",
  "name": "jsplib_instances",
  "field": [
    { "@id": "jobs", "dataType": "Integer" },
    { "@id": "machines", "dataType": "Integer" },
    { "@id": "optimum", "dataType": "Integer" },
    {
      "@id": "bounds.lower",
      "dataType": "Integer"
    },
    {
      "@id": "bounds.upper",
      "dataType": "Integer"
    }
  ]
}

This would allow us in the future to provide access to datasets with metadata following well-accepted standards, promoting as they describe: "discoverability, portability, reproducibility, and responsible AI (RAI)"

tias · 2026-02-22T21:51:35Z

yes, can be good to make it at download time indeed; and following a standard is also fine if it is not too much extra work or code

ThomSerg added 3 commits January 30, 2026 11:36

Cherry pick dataset files

0842d8d

update download docstring

864c48c

metadata using sidecars

3ca77c1

ThomSerg added the blocked Pull request blocked by another pull request/issue. label Jan 30, 2026

ThomSerg added 3 commits January 30, 2026 13:27

fix files without extension

3f10a94

Revert "fix files without extension"

6fefab8

This reverts commit 3f10a94.

fix files without extension 2

6a48da6

ThomSerg added the needs discussion label Jan 30, 2026

tias mentioned this pull request Feb 2, 2026

Datasets, parsers and benchmarks #733

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets metadata#841

Datasets metadata#841
ThomSerg wants to merge 6 commits intomasterfrom
datasets_metadata

ThomSerg commented Jan 30, 2026 •

edited

Loading

Uh oh!

ThomSerg commented Jan 30, 2026

Uh oh!

tias commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ThomSerg commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ThomSerg commented Jan 30, 2026

Uh oh!

tias commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ThomSerg commented Jan 30, 2026 •

edited

Loading