Removing the requirement for Datasets to implement ids() by mtauraso · Pull Request #704 · lincc-frameworks/hyrax

mtauraso · 2026-02-14T00:40:28Z

HSCDataSet and FitsFileDatasets ids() -> _all_ids() iterators
InferenceDataset and DataProvider continue to have ids() methods, but they are entirely based on the get_object_id() HyraxQL getter
Some Iterable Datasets have been removed from tests to simplify a case where ids() was called on all Iterable datasets. More full removal of Iterable Datasets will occur in a later commit
ids() methods on InferenceDataset and DataProvider now return lists to make the callsites more ergonomic
Several lookups on list(obj.ids()) have been refactored to call get_object_id for efficiency and clarity.

- HSCDataSet and FitsFileDatasets ids() -> _all_ids() iterators - InferenceDataset and DataProvider continue to have ids() methods, but they are entirely based on the get_object_id() HyraxQL getter - Some Iterable Datasets have been removed from tests to simplify a case where ids() was called on all Iterable datasets. More full removal of Iterable Datasets will occur in a later commit - ids() methods on InferenceDataset and DataProvider now return lists to make the callsites more ergonomic - Several lookups on list(obj.ids()) have been refactored to call get_object_id for efficiency and clarity.

mtauraso · 2026-02-14T00:44:49Z

This still needs to percolate through example notebooks and docs, which I'm going to try to do on this PR.

Things that are showing up for future in this direction:

Complete removal of IterableDatasets and related functions
Removing the object_id_column_name functionality from the Hyrax Dataset Base class (and the detection code in a bunch of our dataset classes). YAGNI since the model_inputs has to say the primary id field anyway
Removing all the __getitem__ on Dataset classes. Datasets just need to implement getters
Making access to InferenceDataset go via DataProvider. This enables InferenceDataset to actually look like a dataset rather than its current mode where it acts both as a DataProvider (for some verbs) and also a Dataset in other contexts.

codecov · 2026-02-14T00:46:22Z

Codecov Report

❌ Patch coverage is 83.33333% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.59%. Comparing base (0aebbf0) to head (0b00b8a).

Files with missing lines	Patch %	Lines
src/hyrax/verbs/visualize.py	0.00%	3 Missing ⚠️
src/hyrax/data_sets/hsc_data_set.py	0.00%	1 Missing ⚠️
src/hyrax/verbs/lookup.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #704      +/-   ##
==========================================
- Coverage   63.25%   62.59%   -0.67%     
==========================================
  Files          59       59              
  Lines        5770     5737      -33     
==========================================
- Hits         3650     3591      -59     
- Misses       2120     2146      +26

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

This PR refactors Hyrax dataset ID handling to remove the expectation that datasets expose an ids() generator, shifting most callsites toward get_object_id(idx) and keeping ids() as an eager list primarily on DataProvider and InferenceDataSet.

Changes:

Converts multiple callsites (verbs + tests) from list(x.ids()) to x.ids() and/or get_object_id(idx).
Removes iterable dataset implementations/tests (CIFAR + Random) as an interim simplification.
Introduces/standardizes get_object_id() on InferenceDataSet and DataProvider, and makes their ids() methods return lists.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
tests/hyrax/test_umap.py	Updates UMAP ID comparisons to use list-returning `ids()` without extra `list(...)`.
tests/hyrax/test_save_to_database.py	Updates ID extraction for DB save test to match list-returning `ids()`.
tests/hyrax/test_inference_dataset.py	Updates ID enumeration and replaces `list(result.ids())[i]` with `get_object_id(i)`.
tests/hyrax/test_infer.py	Updates inference ordering test to rely on list-returning `ids()`.
tests/hyrax/test_fits_image_dataset.py	Adjusts expectation to compare directly against `DataProvider.ids()` list.
tests/hyrax/test_e2e.py	Removes iterable CIFAR dataset from parametrized E2E matrix.
tests/hyrax/test_data_provider.py	Renames/adjusts primary dataset test and updates ID expectations.
tests/hyrax/conftest.py	Removes iterable random dataset from loopback fixture parameterization.
src/hyrax/verbs/visualize.py	Replaces repeated `list(ids())[idx]` patterns with `get_object_id(idx)`.
src/hyrax/verbs/umap.py	Uses list-returning `InferenceDataSet.ids()` directly when building `all_ids`.
src/hyrax/verbs/lookup.py	Uses list-returning `InferenceDataSet.ids()` directly for lookup indexing.
src/hyrax/pytorch_ignite.py	Simplifies index generation for the no-split dataloader path.
src/hyrax/data_sets/random/hyrax_random_dataset.py	Removes iterable random dataset and its `ids()` generator implementation.
src/hyrax/data_sets/inference_dataset.py	Adds `get_object_id()` and changes `ids()` to return a list; refactors metadata ID lookup.
src/hyrax/data_sets/hyrax_cifar_dataset.py	Removes iterable CIFAR dataset and its `ids()` implementation.
src/hyrax/data_sets/hsc_data_set.py	Switches internal enumeration from `ids()` to `_all_ids()`.
src/hyrax/data_sets/fits_image_dataset.py	Makes FITS dataset ID iterator private (`_all_ids`) and updates internal users.
src/hyrax/data_sets/downloaded_lsst_dataset.py	Removes `ids()` generator implementation.
src/hyrax/data_sets/data_set_registry.py	Removes default `ids()` implementation and metadata auto-injection based on `ids()`.
src/hyrax/data_sets/data_provider.py	Adds `get_object_id()`, makes `ids()` eager list, and enforces primary dataset presence.
src/hyrax/data_sets/init.py	Stops exporting iterable datasets from the public `hyrax.data_sets` surface.

src/hyrax/data_sets/__init__.py

src/hyrax/data_sets/random/hyrax_random_dataset.py

src/hyrax/pytorch_ignite.py

tests/hyrax/test_data_provider.py

Copilot · 2026-02-14T00:48:12Z

tests/hyrax/conftest.py

+@pytest.fixture(scope="function", params=["HyraxRandomDataset"])
 def loopback_hyrax(tmp_path_factory, request):


The fixture no longer parameterizes HyraxRandomIterableDataset, but the if request.param == "HyraxRandomIterableDataset": ... block below is now dead code. Removing that branch will avoid confusion and keep the fixture aligned with the supported datasets in this PR.

tests/hyrax/test_data_provider.py

src/hyrax/data_sets/data_provider.py

src/hyrax/data_sets/inference_dataset.py

tests/hyrax/test_inference_dataset.py

github-actions · 2026-02-14T01:22:50Z

Before [`9fb8b22`]	After [`457076a`]	Ratio	Benchmark (Parameter)
7.46±0.03s	failed	n/a	data_cache_benchmarks.DataCacheBenchmarks.time_preload_cache_hsc1k
16.01618240563191	failed	n/a	data_cache_benchmarks.DataCacheBenchmarks.track_cache_hsc1k_hyrax_size_undercount
115±1μs	failed	n/a	data_request_benchmarks.DatasetRequestBenchmarks.time_request_all_data
1.71G	1.83G	1.06	vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'chromadb')
1.94±0.01s	1.95±0.01s	1.01	benchmarks.time_infer_help
1.93±0.02s	1.96±0.01s	1.01	benchmarks.time_lookup_help
1.94±0.01s	1.96±0.01s	1.01	benchmarks.time_rebuild_manifest_help
1.94±0.03s	1.96±0.01s	1.01	benchmarks.time_train_help
1.94±0.01s	1.95±0.01s	1.01	benchmarks.time_umap_help
3.97G	4.01G	1.01	vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'qdrant')

Click here to view all benchmarks.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

review-notebook-app · 2026-02-20T23:13:15Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Copilot AI review requested due to automatic review settings February 14, 2026 00:40

Merge branch 'main' into mtauraso/no-ids

0b00b8a

Copilot started reviewing on behalf of mtauraso February 14, 2026 00:40 View session

Copilot AI reviewed Feb 14, 2026

View reviewed changes

mtauraso and others added 6 commits February 20, 2026 14:55

Update tests/hyrax/test_inference_dataset.py

bf006f7

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Apply suggestion from @Copilot

976f59c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Apply suggestion from @Copilot

96070c8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Apply suggestion from @Copilot

2b23277

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'main' into mtauraso/no-ids

65a30f0

Fixing up notebooks

4402606

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Removing the requirement for Datasets to implement ids()#704

Removing the requirement for Datasets to implement ids()#704
mtauraso wants to merge 8 commits intomainfrom
mtauraso/no-ids

mtauraso commented Feb 14, 2026

Uh oh!

mtauraso commented Feb 14, 2026

Uh oh!

codecov bot commented Feb 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 14, 2026 •

edited

Loading

Uh oh!

review-notebook-app bot commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@pytest.fixture(scope="function", params=["HyraxRandomDataset"])
		def loopback_hyrax(tmp_path_factory, request):

Comments

Conversation

mtauraso commented Feb 14, 2026

Uh oh!

mtauraso commented Feb 14, 2026

Uh oh!

codecov bot commented Feb 14, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app bot commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Feb 14, 2026 •

edited

Loading