From 0ddf691ae106de7d42a76e6f0f9a5cb2e9d4ef12 Mon Sep 17 00:00:00 2001 From: Uneeb808 <150836183+Uneeb808@users.noreply.github.com> Date: Tue, 31 Mar 2026 00:14:19 +0530 Subject: [PATCH] Create add_dataset_guide.md --- docs/src/add_dataset_guide.md | 80 +++++++++++++++++++++++++++++++++++ 1 file changed, 80 insertions(+) create mode 100644 docs/src/add_dataset_guide.md diff --git a/docs/src/add_dataset_guide.md b/docs/src/add_dataset_guide.md new file mode 100644 index 00000000..fc438f54 --- /dev/null +++ b/docs/src/add_dataset_guide.md @@ -0,0 +1,80 @@ +

How to Add a New Dataset to MLDatasets.jl

+

Based on the process used to add AmazonComputers/AmazonPhoto (#249) and ZINC (#250).

+

Step 1:Fork, clone, and set up upstream

+

Fork JuliaML/MLDatasets.jl on GitHub, then:

+
git clone https://github.com/YOUR_USERNAME/MLDatasets.jl
+cd MLDatasets.jl
+git remote add upstream https://github.com/JuliaML/MLDatasets.jl
+git pull upstream master
+git checkout -b add-your-dataset-name
+
+

Always work on a new branch — never commit directly to master.

+

Step 2 : Write the dataset file

+

Create src/datasets/graphs/yourfile.jl (or the appropriate category subfolder).

+

The dataset file should contain the following components:

+ +

Include using statements only for packages your file actually uses. Some dataset files need them (e.g. amazon.jl uses using DataDeps, using NPZ, using SparseArrays), others don't. Check existing files in src/datasets/graphs/ to see what fits your case.

+

What NOT to include — these are already defined by the module, adding them will cause errors:

+
abstract type AbstractDataset end
+struct Graph ... end
+
+

Also remove all test/debug code before the PR:

+
println(...)
+ENV["DATADEPS_ALWAYS_ACCEPT"] = "true"
+@assert ...
+
+
+

💡 Local testing tip: During development it's useful to run your file standalone outside the module. You can temporarily add stub definitions for AbstractDataset, Graph, etc. at the top — just add a clear comment and remove them before opening the PR.

+
+

Step 3 : Add checksums (don't skip this)

+

After downloading your dataset files, compute SHA256 for each:

+
sha256sum file1.npz
+sha256sum file2.npz
+
+

Add them to your DataDep registration:

+
register(DataDep(
+    "DatasetName",
+    "...",
+    ["url1", "url2"],
+    hash = ["sha256-of-url1-file", "sha256-of-url2-file"]
+))
+
+
+

⚠️ Important: The order of hashes must exactly match the order of URLs. A mismatch causes DataDeps to prompt interactively during CI, which hangs the test process. This is one of the most common reasons first PRs fail CI.

+
+

Do not use ENV["DATADEPS_ALWAYS_ACCEPT"] = "true" in the dataset file — CI must rely on checksums instead.

+

Step 4 :Register in the module

+

Open src/MLDatasets.jl. Find where other graph datasets are included (search for cora or pubmed) and add your lines alongside them:

+
include("datasets/graphs/yourfile.jl")
+export YourDatasetName
+
+

Then find the __init__() function in the same file and add:

+
__init__yourname()
+
+

Step 5 : Add tests

+

In test/datasets/graphs.jl, add a @testset block after the existing ones. Test that the dataset loads, node/edge counts match known values, and features have the correct shape. The exact assertions depend on your dataset.

+

Step 6 : Add documentation

+

In docs/src/datasets/graphs.md, add your dataset name inside the @docs block alongside the others. The documentation page auto-generates from your docstring — no extra writing needed.

+

Step 7 : Commit and push

+
git add .
+git commit -m "Add YourDatasetName graph dataset"
+git push origin add-your-dataset-name
+
+

Step 8 :Open the Pull Request

+

Go to your fork on GitHub and click Compare & Pull Request. Set the base to JuliaML/MLDatasets.jl:master.

+

In the PR description, link the relevant issue so it closes automatically on merge:

+
Closes #<issue_number>
+
+

Step 9 :CI

+

GitHub Actions automatically runs tests on macOS, Ubuntu, Windows, and multiple Julia versions. All checks must pass before a maintainer will review. If something fails, expand the failing check log, fix it locally, and push again — the PR updates automatically and CI reruns.

+ + +
+

This guide is based on the contributor experience from #249 (AmazonComputers/AmazonPhoto) and #250 (ZINC).

+

This Closes #234.