Conversation
|
Maybe we should already start thinking about a better structure for this metadata, so that in the future we can more easily connect it to standardised dataset metadata formats such as MLCommons' Croissant. These standards are only authorative at the dataset level, not individual instances, but we should still format it in a future-proof manner. E.g. # <instance>.meta.json
{
"type": "instance-metadata",
"dataset": "jsplib",
"schema": "croissant.json#jsplib_instances",
"instanceId": "abz5",
"features": {
"jobs": 10,
"machines": 10,
"optimum": 1234,
"bounds": { "lower": 1234, "upper": 1234 }
}
}And then a snippet of a potential croissant description: {
"@type": "cr:RecordSet",
"name": "jsplib_instances",
"field": [
{ "@id": "jobs", "dataType": "Integer" },
{ "@id": "machines", "dataType": "Integer" },
{ "@id": "optimum", "dataType": "Integer" },
{
"@id": "bounds.lower",
"dataType": "Integer"
},
{
"@id": "bounds.upper",
"dataType": "Integer"
}
]
}This would allow us in the future to provide access to datasets with metadata following well-accepted standards, promoting as they describe: "discoverability, portability, reproducibility, and responsible AI (RAI)" |
|
yes, can be good to make it at download time indeed; and following a standard is also fine if it is not too much extra work or code |
Follow-up of #840, standardising instance-level metadata extraction and storage as
.meta.jsonsidecar files.(still also includes the changes from #840)
Instead of extracting metadata on the spot (during iteration over the dataset), this PR proposes to perform this ahead of time on initial download (since depending on what metadata you want to add, this can take some time). New datasets only have to implement
For each instance, metadata gets stored in a separate metadata sidecar file
<instance_name_without_extensions>.meta.jsonFor example, for jsplib' instance
abz5,abz5.meta.json:{ "name": "abz5", "jobs": 10, "machines": 10, "optimum": 1234, "bounds": { "upper": 1234, "lower": 1234 } }This data was already getting collected before, from scratch on each access of the dataset. Now it is stored in a reusable metadata file.