Skip to content

Updated data_request to support splits#690

Open
drewoldag wants to merge 9 commits intomainfrom
issue/663/define-split-fraction-in-data-request
Open

Updated data_request to support splits#690
drewoldag wants to merge 9 commits intomainfrom
issue/663/define-split-fraction-in-data-request

Conversation

@drewoldag
Copy link
Collaborator

@drewoldag drewoldag commented Feb 11, 2026

The initial work in this PR adds pydantic support that will:

  • Fully resolve the data_location path, and ensure that it is persisted in the runtime_config.toml file
  • Validate optional split_fraction values between 0 and 1
  • Ensure that the split_fraction key is only present on DataRequestConfigs that include primary_id_field.
  • Check that split_fraction values sum to <1 for all DataRequestConfigs that have the same data_location value.
  • Remove the hard coded names for train, validate, and infer, so that we can include other names without needing to update the data_request.py schema.
  • Require that for splits that use the same data_location - all or none of them must define a split_fraction.

Additionally, all of these changes should be covered by new or updated unit tests.

This PR also includes the work to utilize the split_fractions defined in the data_request to be used for defining the splits in place of the [data_set][train_size|validate_size|test_size].

If the legacy splits are discovered, we will log deprecation warnings and provide instructions for migrating from the old [data_set][train_size] split definition to the split_fractions defined in a data_request.

@codecov
Copy link

codecov bot commented Feb 11, 2026

Codecov Report

❌ Patch coverage is 94.17476% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.87%. Comparing base (4c93171) to head (5c62edf).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/hyrax/verbs/engine.py 0.00% 9 Missing ⚠️
src/hyrax/config_schemas/data_request.py 98.96% 1 Missing ⚠️
src/hyrax/pytorch_ignite.py 98.38% 1 Missing ⚠️
src/hyrax/verbs/train.py 95.65% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #690      +/-   ##
==========================================
+ Coverage   64.17%   64.87%   +0.70%     
==========================================
  Files          61       61              
  Lines        5892     6024     +132     
==========================================
+ Hits         3781     3908     +127     
- Misses       2111     2116       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link

github-actions bot commented Feb 11, 2026

Before [4c93171] After [122ea0a] Ratio Benchmark (Parameter)
178±1ms 253±2ms 1.43 benchmarks.time_import
104±1μs 111±5μs 1.07 data_request_benchmarks.DatasetRequestBenchmarks.time_request_all_data
1.72±0.01s 1.74±0.02s 1.01 benchmarks.time_database_connection_help
54.9±0.2s 55.3±0.03s 1.01 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(16384, 'qdrant')
1.75±0.03s 1.74±0.04s 1 benchmarks.time_help
1.75±0.04s 1.74±0.02s 1 benchmarks.time_infer_help
1.07G 1.06G 1 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(256, 'chromadb')
1.04G 1.04G 1 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(64, 'chromadb')
1.05G 1.05G 1 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(64, 'qdrant')
29.1±0.02s 29.0±0.02s 1 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(16384, 'chromadb')

Click here to view all benchmarks.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class split_fraction support to data_request via Pydantic validation and wires those fractions through dataset setup and dataloader creation so split definitions can live in data_request instead of [data_set].(train_size/validate_size/test_size).

Changes:

  • Extend data_request schemas with split_fraction validation, absolute data_location normalization, and cross-group consistency checks.
  • Update dataset setup / dataloading to compute and respect per-group split_indices derived from split_fraction.
  • Update verbs to request only relevant dataset groups (via setup_dataset(..., splits=..., shuffle=...)) and add integration/unit tests for the new behavior.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/hyrax/test_split_fraction_integration.py New integration tests covering DataProvider → create_splits_from_fractions → dist_data_loader.
tests/hyrax/test_data_request_config.py Expanded unit tests for new schema behaviors (path resolution, fraction validation, arbitrary group names).
src/hyrax/verbs/train.py Uses split-aware dataset setup and introduces split-handling logic + deprecation warning for legacy sizes.
src/hyrax/verbs/to_onnx.py Loads only the infer split deterministically for ONNX export.
src/hyrax/verbs/test.py Loads only the test split deterministically.
src/hyrax/verbs/infer.py Loads only the infer split deterministically and documents split-index behavior.
src/hyrax/verbs/engine.py Loads only the infer split deterministically.
src/hyrax/pytorch_ignite.py Adds splits/shuffle to setup_dataset, computes split_indices, updates dist_data_loader, and adds create_splits_from_fractions.
src/hyrax/data_sets/data_provider.py Stores split_fraction, primary_data_location, and externally assigned split_indices.
src/hyrax/config_schemas/data_request.py Refactors DataRequestDefinition into a RootModel supporting arbitrary group names + split-fraction validation rules.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@drewoldag drewoldag marked this pull request as ready for review February 13, 2026 22:40
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be removed before merging to main.

# we can bring up Only the metadata for a dataset, without constructing the whole thing.
self._original_dataset_config["data_set"]["preload_cache"] = False
self.original_dataset = setup_dataset(self._original_dataset_config) # type: ignore[arg-type]
self.original_dataset = setup_dataset(self._original_dataset_config, splits=("infer",), shuffle=False) # type: ignore[arg-type]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtauraso specifically - if you wouldn't mind helping me think through places where this might cause regressions, I would appreciate it. I've run train->infer->umap and things seem to behave. But I'm not confident that I've covered everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Define data split percentages within data_request Resolve data_location directory

1 participant

Comments