Moving precache threads to config by mtauraso · Pull Request #673 · lincc-frameworks/hyrax

mtauraso · 2026-02-04T18:25:08Z

No description provided.

codecov · 2026-02-04T18:30:18Z

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.94%. Comparing base (f90813e) to head (a58b902).

Files with missing lines	Patch %	Lines
src/hyrax/data_sets/data_cache.py	0.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #673      +/-   ##
==========================================
- Coverage   62.95%   62.94%   -0.02%     
==========================================
  Files          58       58              
  Lines        5782     5780       -2     
==========================================
- Hits         3640     3638       -2     
  Misses       2142     2142

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

This PR moves the DataCache preload thread count from a hard-coded value to a runtime configuration option.

Changes:

Add data_set.preload_threads (default 50) to hyrax_default_config.toml.
Update DataCache to use config["data_set"]["preload_threads"] for ThreadPoolExecutor(max_workers=...) and for the lazy executor’s in-flight futures limit.
Remove the previously hard-coded _determine_numprocs_preload() helper.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
`src/hyrax/hyrax_default_config.toml`	Introduces the `preload_threads` configuration knob for cache preloading.
`src/hyrax/data_sets/data_cache.py`	Switches preload concurrency to be driven by `data_set.preload_threads` instead of a hard-coded constant.

Copilot · 2026-02-04T18:30:17Z

src/hyrax/data_sets/data_cache.py

        # Save config we need
        self._use_cache = config["data_set"]["use_cache"]
        self._preload_cache = config["data_set"]["preload_cache"]
+        self._preload_threads = config["data_set"]["preload_threads"]


preload_threads is read unconditionally from config["data_set"], which will raise a KeyError for configs created from older saved runs (e.g., InferenceDataSet loads original_dataset_config.toml via ConfigManager.read_runtime_config(...) without merging defaults) or any caller that passes a partial config dict. Since this value is only needed when preload_cache is enabled, either (a) default it with config["data_set"].get("preload_threads", 50) and cast/validate it as a positive int, or (b) only read/validate it inside the if self._preload_cache and self._use_cache branch.

Suggested change

self._preload_threads = config["data_set"]["preload_threads"]

preload_threads = config["data_set"].get("preload_threads", 50)

try:

preload_threads = int(preload_threads)

except (TypeError, ValueError) as exc:

raise ValueError(

"config['data_set']['preload_threads'] must be an integer, "

f"got {preload_threads!r}"

) from exc

if preload_threads <= 0:

raise ValueError(

"config['data_set']['preload_threads'] must be a positive integer, "

f"got {preload_threads!r}"

)

self._preload_threads = preload_threads

Copilot · 2026-02-04T18:30:17Z

src/hyrax/data_sets/data_cache.py

+        ## We use self._preload threads here as both the number of workers and
+        ## the number of in-flight futures that this lazy map executor keeps running
+        ##
+        ## The goal here is actually maximum filesystem performance on a high-latency filesystem
+        ## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is


The new block comment is internally inconsistent/incomplete: it refers to self._preload threads (missing _threads) and ends the sentence "where 50 threads is" without finishing the thought. This makes the guidance hard to follow; please correct the attribute name in the comment and complete or remove the unfinished sentence.

Suggested change

## We use self._preload threads here as both the number of workers and

## the number of in-flight futures that this lazy map executor keeps running

##

## The goal here is actually maximum filesystem performance on a high-latency filesystem

## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is

## We use self._preload_threads here as both the number of workers and

## the number of in-flight futures that this lazy map executor keeps running.

##

## The goal here is actually maximum filesystem performance on a high-latency filesystem.

## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is an

## empirically chosen near-optimal default for preloading performance on that system.

src/hyrax/data_sets/data_cache.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

github-actions · 2026-02-04T19:30:04Z

Before [`f90813e`]	After [`ef23f0e`]	Ratio	Benchmark (Parameter)
failed	failed	n/a	data_cache_benchmarks.DataCacheBenchmarks.time_preload_cache_cifar
failed	failed	n/a	data_cache_benchmarks.DataCacheBenchmarks.track_cache_cifar_hyrax_size_undercount
196±1ms	199±2ms	1.02	benchmarks.time_import
37.1±0.3ms	38.0±0.3ms	1.02	benchmarks.time_nb_obj_dir
1.64G	1.67G	1.02	vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'chromadb')
36.0±0.1ms	36.5±0.2ms	1.01	benchmarks.time_nb_obj_construct
1.90±0.01s	1.92±0.02s	1.01	benchmarks.time_prepare_help
1.91±0.02s	1.92±0.01s	1.01	benchmarks.time_rebuild_manifest_help
1.89±0.02s	1.90±0.02s	1.01	benchmarks.time_visualize_help
3.74G	3.77G	1.01	vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'qdrant')

Click here to view all benchmarks.

drewoldag

What I see here seems fine. I like that you can easily adjust the threads now. I am a little concerned that a general user reading the comment in the default config might not be able to get out of it what they need.

drewoldag · 2026-02-04T20:03:23Z

src/hyrax/hyrax_default_config.toml

+# and your storage layer has high latency or is IOPS limited. Avoid writing your dataset class this way,
+# but if you do then a large number of preload threads will maximize io throughput


I'm not sure that general users are going to understand what you mean with these last two lines. How will a user know if their storage layer has high latency or is IOPS limited. What are they to avoid when writing their dataset class? What is a "large" number of threads?

Yeah, You're right. I'll rewrite this and try to give some specific guidance of "if pre-caching is slow try..."

Moving precache threads to config

8d4f94a

mtauraso requested a review from drewoldag February 4, 2026 18:25

mtauraso self-assigned this Feb 4, 2026

Copilot AI review requested due to automatic review settings February 4, 2026 18:25

Copilot started reviewing on behalf of mtauraso February 4, 2026 18:25 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

mtauraso and others added 5 commits February 4, 2026 10:31

Update src/hyrax/data_sets/data_cache.py

f2fcb1c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fixup from code review comments

83c7a40

Fixup from code review comments

e9a707f

Apply suggestion from @Copilot

f53b415

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Apply suggestion from @Copilot

a58b902

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

drewoldag approved these changes Feb 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving precache threads to config#673

Moving precache threads to config#673
mtauraso wants to merge 6 commits intomainfrom
mtauraso/precache-config

mtauraso commented Feb 4, 2026

Uh oh!

codecov bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

drewoldag left a comment

Uh oh!

drewoldag Feb 4, 2026

Uh oh!

mtauraso Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

-        self._preload_threads = config["data_set"]["preload_threads"]
+        preload_threads = config["data_set"].get("preload_threads", 50)
+        try:
+            preload_threads = int(preload_threads)
+        except (TypeError, ValueError) as exc:
+            raise ValueError(
+                "config['data_set']['preload_threads'] must be an integer, "
+                f"got {preload_threads!r}"
+            ) from exc
+        if preload_threads <= 0:
+            raise ValueError(
+                "config['data_set']['preload_threads'] must be a positive integer, "
+                f"got {preload_threads!r}"
+            )
+        self._preload_threads = preload_threads

-        ## We use self._preload threads here as both the number of workers and
-        ## the number of in-flight futures that this lazy map executor keeps running
-        ##
-        ## The goal here is actually maximum filesystem performance on a high-latency filesystem
-        ## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is
+        ## We use self._preload_threads here as both the number of workers and
+        ## the number of in-flight futures that this lazy map executor keeps running.
+        ##
+        ## The goal here is actually maximum filesystem performance on a high-latency filesystem.
+        ## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is an
+        ## empirically chosen near-optimal default for preloading performance on that system.

		# and your storage layer has high latency or is IOPS limited. Avoid writing your dataset class this way,
		# but if you do then a large number of preload threads will maximize io throughput

Conversation

mtauraso commented Feb 4, 2026

Uh oh!

codecov bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

drewoldag left a comment

Choose a reason for hiding this comment

Uh oh!

drewoldag Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

mtauraso Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

codecov bot commented Feb 4, 2026 •

edited

Loading