Updated `data_request` to support splits by drewoldag · Pull Request #690 · lincc-frameworks/hyrax

drewoldag · 2026-02-11T18:58:44Z

The initial work in this PR adds pydantic support that will:

Fully resolve the data_location path, and ensure that it is persisted in the runtime_config.toml file
Validate optional split_fraction values between 0 and 1
Ensure that the split_fraction key is only present on DataRequestConfigs that include primary_id_field.
Check that split_fraction values sum to <1 for all DataRequestConfigs that have the same data_location value.
Remove the hard coded names for train, validate, and infer, so that we can include other names without needing to update the data_request.py schema.
Require that for splits that use the same data_location - all or none of them must define a split_fraction.

Additionally, all of these changes should be covered by new or updated unit tests.

This PR also includes the work to utilize the split_fractions defined in the data_request to be used for defining the splits in place of the [data_set][train_size|validate_size|test_size].

If the legacy splits are discovered, we will log deprecation warnings and provide instructions for migrating from the old [data_set][train_size] split definition to the split_fractions defined in a data_request.

…ion.

codecov · 2026-02-11T19:04:30Z

Codecov Report

❌ Patch coverage is 94.17476% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.87%. Comparing base (4c93171) to head (5c62edf).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/hyrax/verbs/engine.py	0.00%	9 Missing ⚠️
src/hyrax/config_schemas/data_request.py	98.96%	1 Missing ⚠️
src/hyrax/pytorch_ignite.py	98.38%	1 Missing ⚠️
src/hyrax/verbs/train.py	95.65%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #690      +/-   ##
==========================================
+ Coverage   64.17%   64.87%   +0.70%     
==========================================
  Files          61       61              
  Lines        5892     6024     +132     
==========================================
+ Hits         3781     3908     +127     
- Misses       2111     2116       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2026-02-11T19:40:17Z

Before [`4c93171`]	After [`122ea0a`]	Ratio	Benchmark (Parameter)
178±1ms	253±2ms	1.43	benchmarks.time_import
104±1μs	111±5μs	1.07	data_request_benchmarks.DatasetRequestBenchmarks.time_request_all_data
1.72±0.01s	1.74±0.02s	1.01	benchmarks.time_database_connection_help
54.9±0.2s	55.3±0.03s	1.01	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(16384, 'qdrant')
1.75±0.03s	1.74±0.04s	1	benchmarks.time_help
1.75±0.04s	1.74±0.02s	1	benchmarks.time_infer_help
1.07G	1.06G	1	vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(256, 'chromadb')
1.04G	1.04G	1	vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(64, 'chromadb')
1.05G	1.05G	1	vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(64, 'qdrant')
29.1±0.02s	29.0±0.02s	1	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(16384, 'chromadb')

Click here to view all benchmarks.

…tion, all the splits for the same data_location define a split_fraction.

Copilot

Pull request overview

Adds first-class split_fraction support to data_request via Pydantic validation and wires those fractions through dataset setup and dataloader creation so split definitions can live in data_request instead of [data_set].(train_size/validate_size/test_size).

Changes:

Extend data_request schemas with split_fraction validation, absolute data_location normalization, and cross-group consistency checks.
Update dataset setup / dataloading to compute and respect per-group split_indices derived from split_fraction.
Update verbs to request only relevant dataset groups (via setup_dataset(..., splits=..., shuffle=...)) and add integration/unit tests for the new behavior.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/hyrax/test_split_fraction_integration.py	New integration tests covering `DataProvider → create_splits_from_fractions → dist_data_loader`.
tests/hyrax/test_data_request_config.py	Expanded unit tests for new schema behaviors (path resolution, fraction validation, arbitrary group names).
src/hyrax/verbs/train.py	Uses split-aware dataset setup and introduces split-handling logic + deprecation warning for legacy sizes.
src/hyrax/verbs/to_onnx.py	Loads only the `infer` split deterministically for ONNX export.
src/hyrax/verbs/test.py	Loads only the `test` split deterministically.
src/hyrax/verbs/infer.py	Loads only the `infer` split deterministically and documents split-index behavior.
src/hyrax/verbs/engine.py	Loads only the `infer` split deterministically.
src/hyrax/pytorch_ignite.py	Adds `splits`/`shuffle` to `setup_dataset`, computes `split_indices`, updates `dist_data_loader`, and adds `create_splits_from_fractions`.
src/hyrax/data_sets/data_provider.py	Stores `split_fraction`, `primary_data_location`, and externally assigned `split_indices`.
src/hyrax/config_schemas/data_request.py	Refactors `DataRequestDefinition` into a `RootModel` supporting arbitrary group names + split-fraction validation rules.

src/hyrax/pytorch_ignite.py

src/hyrax/config_schemas/data_request.py

src/hyrax/verbs/train.py

src/hyrax/pytorch_ignite.py

…ugs/edge cases.

review-notebook-app · 2026-02-13T22:34:47Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

drewoldag · 2026-02-13T22:46:05Z

TODO.md

This will be removed before merging to main.

drewoldag · 2026-02-13T22:54:47Z

src/hyrax/data_sets/inference_dataset.py

        #       we can bring up Only the metadata for a dataset, without constructing the whole thing.
        self._original_dataset_config["data_set"]["preload_cache"] = False
-        self.original_dataset = setup_dataset(self._original_dataset_config)  # type: ignore[arg-type]
+        self.original_dataset = setup_dataset(self._original_dataset_config, splits=("infer",), shuffle=False)  # type: ignore[arg-type]


@mtauraso specifically - if you wouldn't mind helping me think through places where this might cause regressions, I would appreciate it. I've run train->infer->umap and things seem to behave. But I'm not confident that I've covered everything.

…ts from different directories.

Updated the pydantic validators for data_request to allow split_fract…

41ec84a

…ion.

drewoldag self-assigned this Feb 11, 2026

This was linked to issues Feb 11, 2026

Resolve data_location directory #588

Open

Define data split percentages within data_request #663

Open

drewoldag added 2 commits February 11, 2026 13:58

Adding pydantic test to ensure that if one split defines a split_frac…

24624da

…tion, all the splits for the same data_location define a split_fraction.

WIP - Making use of the splits defined in data_request.

10200f0

drewoldag requested a review from Copilot February 12, 2026 05:18

Copilot started reviewing on behalf of drewoldag February 12, 2026 05:19 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

drewoldag and others added 3 commits February 12, 2026 14:29

Adding temporary TODO.md file so that I can work through discovered b…

854b404

…ugs/edge cases.

Worked through TODO.md items.

64d2e9e

Updating existing documentation and fixed new bug in InferenceDataSet.

a3f8efa

Merge branch 'main' into issue/663/define-split-fraction-in-data-request

0591e72

drewoldag marked this pull request as ready for review February 13, 2026 22:40

drewoldag requested review from SamSandwich07, gitosaurus and mtauraso February 13, 2026 22:40

drewoldag commented Feb 13, 2026

View reviewed changes

TODO.md

Copy link

Collaborator Author

drewoldag Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be removed before merging to main.

drewoldag commented Feb 13, 2026

View reviewed changes

drewoldag and others added 2 commits February 13, 2026 16:48

Addressed a bug in train that would have caused a regression for spli…

a79ad99

…ts from different directories.

Merge branch 'main' into issue/663/define-split-fraction-in-data-request

5c62edf

This was referenced Feb 19, 2026

'data' key not required; can be any friendly name. #710

Merged

visualize verb should find all its inputs from config. #713

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated `data_request` to support splits#690

Updated `data_request` to support splits#690
drewoldag wants to merge 9 commits intomainfrom
issue/663/define-split-fraction-in-data-request

drewoldag commented Feb 11, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 11, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

review-notebook-app bot commented Feb 13, 2026

Uh oh!

drewoldag Feb 13, 2026

Uh oh!

drewoldag Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

drewoldag commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

review-notebook-app bot commented Feb 13, 2026

Uh oh!

drewoldag Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

drewoldag Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

drewoldag commented Feb 11, 2026 •

edited

Loading

codecov bot commented Feb 11, 2026 •

edited

Loading

github-actions bot commented Feb 11, 2026 •

edited

Loading