Skip to content

Comments

Removing the requirement for Datasets to implement ids()#704

Open
mtauraso wants to merge 8 commits intomainfrom
mtauraso/no-ids
Open

Removing the requirement for Datasets to implement ids()#704
mtauraso wants to merge 8 commits intomainfrom
mtauraso/no-ids

Conversation

@mtauraso
Copy link
Collaborator

  • HSCDataSet and FitsFileDatasets ids() -> _all_ids() iterators
  • InferenceDataset and DataProvider continue to have ids() methods, but they are entirely based on the get_object_id() HyraxQL getter
  • Some Iterable Datasets have been removed from tests to simplify a case where ids() was called on all Iterable datasets. More full removal of Iterable Datasets will occur in a later commit
  • ids() methods on InferenceDataset and DataProvider now return lists to make the callsites more ergonomic
  • Several lookups on list(obj.ids()) have been refactored to call get_object_id for efficiency and clarity.

- HSCDataSet and FitsFileDatasets ids() -> _all_ids() iterators
- InferenceDataset and DataProvider continue to have ids() methods, but
  they are entirely based on the get_object_id() HyraxQL getter
- Some Iterable Datasets have been removed from tests to simplify
  a case where ids() was called on all Iterable datasets. More full removal
  of Iterable Datasets will occur in a later commit
- ids() methods on InferenceDataset and DataProvider now return lists to make
  the callsites more ergonomic
- Several lookups on list(obj.ids()) have been refactored to call get_object_id
  for efficiency and clarity.
Copilot AI review requested due to automatic review settings February 14, 2026 00:40
@mtauraso
Copy link
Collaborator Author

This still needs to percolate through example notebooks and docs, which I'm going to try to do on this PR.

Things that are showing up for future in this direction:

  1. Complete removal of IterableDatasets and related functions
  2. Removing the object_id_column_name functionality from the Hyrax Dataset Base class (and the detection code in a bunch of our dataset classes). YAGNI since the model_inputs has to say the primary id field anyway
  3. Removing all the __getitem__ on Dataset classes. Datasets just need to implement getters
  4. Making access to InferenceDataset go via DataProvider. This enables InferenceDataset to actually look like a dataset rather than its current mode where it acts both as a DataProvider (for some verbs) and also a Dataset in other contexts.

@codecov
Copy link

codecov bot commented Feb 14, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.59%. Comparing base (0aebbf0) to head (0b00b8a).

Files with missing lines Patch % Lines
src/hyrax/verbs/visualize.py 0.00% 3 Missing ⚠️
src/hyrax/data_sets/hsc_data_set.py 0.00% 1 Missing ⚠️
src/hyrax/verbs/lookup.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #704      +/-   ##
==========================================
- Coverage   63.25%   62.59%   -0.67%     
==========================================
  Files          59       59              
  Lines        5770     5737      -33     
==========================================
- Hits         3650     3591      -59     
- Misses       2120     2146      +26     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors Hyrax dataset ID handling to remove the expectation that datasets expose an ids() generator, shifting most callsites toward get_object_id(idx) and keeping ids() as an eager list primarily on DataProvider and InferenceDataSet.

Changes:

  • Converts multiple callsites (verbs + tests) from list(x.ids()) to x.ids() and/or get_object_id(idx).
  • Removes iterable dataset implementations/tests (CIFAR + Random) as an interim simplification.
  • Introduces/standardizes get_object_id() on InferenceDataSet and DataProvider, and makes their ids() methods return lists.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/hyrax/test_umap.py Updates UMAP ID comparisons to use list-returning ids() without extra list(...).
tests/hyrax/test_save_to_database.py Updates ID extraction for DB save test to match list-returning ids().
tests/hyrax/test_inference_dataset.py Updates ID enumeration and replaces list(result.ids())[i] with get_object_id(i).
tests/hyrax/test_infer.py Updates inference ordering test to rely on list-returning ids().
tests/hyrax/test_fits_image_dataset.py Adjusts expectation to compare directly against DataProvider.ids() list.
tests/hyrax/test_e2e.py Removes iterable CIFAR dataset from parametrized E2E matrix.
tests/hyrax/test_data_provider.py Renames/adjusts primary dataset test and updates ID expectations.
tests/hyrax/conftest.py Removes iterable random dataset from loopback fixture parameterization.
src/hyrax/verbs/visualize.py Replaces repeated list(ids())[idx] patterns with get_object_id(idx).
src/hyrax/verbs/umap.py Uses list-returning InferenceDataSet.ids() directly when building all_ids.
src/hyrax/verbs/lookup.py Uses list-returning InferenceDataSet.ids() directly for lookup indexing.
src/hyrax/pytorch_ignite.py Simplifies index generation for the no-split dataloader path.
src/hyrax/data_sets/random/hyrax_random_dataset.py Removes iterable random dataset and its ids() generator implementation.
src/hyrax/data_sets/inference_dataset.py Adds get_object_id() and changes ids() to return a list; refactors metadata ID lookup.
src/hyrax/data_sets/hyrax_cifar_dataset.py Removes iterable CIFAR dataset and its ids() implementation.
src/hyrax/data_sets/hsc_data_set.py Switches internal enumeration from ids() to _all_ids().
src/hyrax/data_sets/fits_image_dataset.py Makes FITS dataset ID iterator private (_all_ids) and updates internal users.
src/hyrax/data_sets/downloaded_lsst_dataset.py Removes ids() generator implementation.
src/hyrax/data_sets/data_set_registry.py Removes default ids() implementation and metadata auto-injection based on ids().
src/hyrax/data_sets/data_provider.py Adds get_object_id(), makes ids() eager list, and enforces primary dataset presence.
src/hyrax/data_sets/init.py Stops exporting iterable datasets from the public hyrax.data_sets surface.

Comment on lines +35 to 36
@pytest.fixture(scope="function", params=["HyraxRandomDataset"])
def loopback_hyrax(tmp_path_factory, request):
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fixture no longer parameterizes HyraxRandomIterableDataset, but the if request.param == "HyraxRandomIterableDataset": ... block below is now dead code. Removing that branch will avoid confusion and keep the fixture aligned with the supported datasets in this PR.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link

github-actions bot commented Feb 14, 2026

Before [9fb8b22] After [457076a] Ratio Benchmark (Parameter)
7.46±0.03s failed n/a data_cache_benchmarks.DataCacheBenchmarks.time_preload_cache_hsc1k
16.01618240563191 failed n/a data_cache_benchmarks.DataCacheBenchmarks.track_cache_hsc1k_hyrax_size_undercount
115±1μs failed n/a data_request_benchmarks.DatasetRequestBenchmarks.time_request_all_data
1.71G 1.83G 1.06 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'chromadb')
1.94±0.01s 1.95±0.01s 1.01 benchmarks.time_infer_help
1.93±0.02s 1.96±0.01s 1.01 benchmarks.time_lookup_help
1.94±0.01s 1.96±0.01s 1.01 benchmarks.time_rebuild_manifest_help
1.94±0.03s 1.96±0.01s 1.01 benchmarks.time_train_help
1.94±0.01s 1.95±0.01s 1.01 benchmarks.time_umap_help
3.97G 4.01G 1.01 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'qdrant')

Click here to view all benchmarks.

mtauraso and others added 6 commits February 20, 2026 14:55
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant