Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #690 +/- ##
==========================================
+ Coverage 64.17% 64.87% +0.70%
==========================================
Files 61 61
Lines 5892 6024 +132
==========================================
+ Hits 3781 3908 +127
- Misses 2111 2116 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Click here to view all benchmarks. |
…tion, all the splits for the same data_location define a split_fraction.
There was a problem hiding this comment.
Pull request overview
Adds first-class split_fraction support to data_request via Pydantic validation and wires those fractions through dataset setup and dataloader creation so split definitions can live in data_request instead of [data_set].(train_size/validate_size/test_size).
Changes:
- Extend
data_requestschemas withsplit_fractionvalidation, absolutedata_locationnormalization, and cross-group consistency checks. - Update dataset setup / dataloading to compute and respect per-group
split_indicesderived fromsplit_fraction. - Update verbs to request only relevant dataset groups (via
setup_dataset(..., splits=..., shuffle=...)) and add integration/unit tests for the new behavior.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/hyrax/test_split_fraction_integration.py | New integration tests covering DataProvider → create_splits_from_fractions → dist_data_loader. |
| tests/hyrax/test_data_request_config.py | Expanded unit tests for new schema behaviors (path resolution, fraction validation, arbitrary group names). |
| src/hyrax/verbs/train.py | Uses split-aware dataset setup and introduces split-handling logic + deprecation warning for legacy sizes. |
| src/hyrax/verbs/to_onnx.py | Loads only the infer split deterministically for ONNX export. |
| src/hyrax/verbs/test.py | Loads only the test split deterministically. |
| src/hyrax/verbs/infer.py | Loads only the infer split deterministically and documents split-index behavior. |
| src/hyrax/verbs/engine.py | Loads only the infer split deterministically. |
| src/hyrax/pytorch_ignite.py | Adds splits/shuffle to setup_dataset, computes split_indices, updates dist_data_loader, and adds create_splits_from_fractions. |
| src/hyrax/data_sets/data_provider.py | Stores split_fraction, primary_data_location, and externally assigned split_indices. |
| src/hyrax/config_schemas/data_request.py | Refactors DataRequestDefinition into a RootModel supporting arbitrary group names + split-fraction validation rules. |
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
This will be removed before merging to main.
| # we can bring up Only the metadata for a dataset, without constructing the whole thing. | ||
| self._original_dataset_config["data_set"]["preload_cache"] = False | ||
| self.original_dataset = setup_dataset(self._original_dataset_config) # type: ignore[arg-type] | ||
| self.original_dataset = setup_dataset(self._original_dataset_config, splits=("infer",), shuffle=False) # type: ignore[arg-type] |
There was a problem hiding this comment.
@mtauraso specifically - if you wouldn't mind helping me think through places where this might cause regressions, I would appreciate it. I've run train->infer->umap and things seem to behave. But I'm not confident that I've covered everything.
The initial work in this PR adds pydantic support that will:
data_locationpath, and ensure that it is persisted in the runtime_config.toml filesplit_fractionvalues between 0 and 1split_fractionkey is only present onDataRequestConfigs that includeprimary_id_field.split_fractionvalues sum to <1 for allDataRequestConfigs that have the samedata_locationvalue.train,validate, andinfer, so that we can include other names without needing to update the data_request.py schema.data_location- all or none of them must define asplit_fraction.Additionally, all of these changes should be covered by new or updated unit tests.
This PR also includes the work to utilize the split_fractions defined in the
data_requestto be used for defining the splits in place of the[data_set][train_size|validate_size|test_size].If the legacy splits are discovered, we will log deprecation warnings and provide instructions for migrating from the old
[data_set][train_size]split definition to the split_fractions defined in a data_request.