Skip to content

Moving precache threads to config#673

Open
mtauraso wants to merge 6 commits intomainfrom
mtauraso/precache-config
Open

Moving precache threads to config#673
mtauraso wants to merge 6 commits intomainfrom
mtauraso/precache-config

Conversation

@mtauraso
Copy link
Collaborator

@mtauraso mtauraso commented Feb 4, 2026

No description provided.

@mtauraso mtauraso requested a review from drewoldag February 4, 2026 18:25
@mtauraso mtauraso self-assigned this Feb 4, 2026
Copilot AI review requested due to automatic review settings February 4, 2026 18:25
@codecov
Copy link

codecov bot commented Feb 4, 2026

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.94%. Comparing base (f90813e) to head (a58b902).

Files with missing lines Patch % Lines
src/hyrax/data_sets/data_cache.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #673      +/-   ##
==========================================
- Coverage   62.95%   62.94%   -0.02%     
==========================================
  Files          58       58              
  Lines        5782     5780       -2     
==========================================
- Hits         3640     3638       -2     
  Misses       2142     2142              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR moves the DataCache preload thread count from a hard-coded value to a runtime configuration option.

Changes:

  • Add data_set.preload_threads (default 50) to hyrax_default_config.toml.
  • Update DataCache to use config["data_set"]["preload_threads"] for ThreadPoolExecutor(max_workers=...) and for the lazy executor’s in-flight futures limit.
  • Remove the previously hard-coded _determine_numprocs_preload() helper.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
src/hyrax/hyrax_default_config.toml Introduces the preload_threads configuration knob for cache preloading.
src/hyrax/data_sets/data_cache.py Switches preload concurrency to be driven by data_set.preload_threads instead of a hard-coded constant.

# Save config we need
self._use_cache = config["data_set"]["use_cache"]
self._preload_cache = config["data_set"]["preload_cache"]
self._preload_threads = config["data_set"]["preload_threads"]
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preload_threads is read unconditionally from config["data_set"], which will raise a KeyError for configs created from older saved runs (e.g., InferenceDataSet loads original_dataset_config.toml via ConfigManager.read_runtime_config(...) without merging defaults) or any caller that passes a partial config dict. Since this value is only needed when preload_cache is enabled, either (a) default it with config["data_set"].get("preload_threads", 50) and cast/validate it as a positive int, or (b) only read/validate it inside the if self._preload_cache and self._use_cache branch.

Suggested change
self._preload_threads = config["data_set"]["preload_threads"]
preload_threads = config["data_set"].get("preload_threads", 50)
try:
preload_threads = int(preload_threads)
except (TypeError, ValueError) as exc:
raise ValueError(
"config['data_set']['preload_threads'] must be an integer, "
f"got {preload_threads!r}"
) from exc
if preload_threads <= 0:
raise ValueError(
"config['data_set']['preload_threads'] must be a positive integer, "
f"got {preload_threads!r}"
)
self._preload_threads = preload_threads

Copilot uses AI. Check for mistakes.
Comment on lines 250 to 254
## We use self._preload threads here as both the number of workers and
## the number of in-flight futures that this lazy map executor keeps running
##
## The goal here is actually maximum filesystem performance on a high-latency filesystem
## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new block comment is internally inconsistent/incomplete: it refers to self._preload threads (missing _threads) and ends the sentence "where 50 threads is" without finishing the thought. This makes the guidance hard to follow; please correct the attribute name in the comment and complete or remove the unfinished sentence.

Suggested change
## We use self._preload threads here as both the number of workers and
## the number of in-flight futures that this lazy map executor keeps running
##
## The goal here is actually maximum filesystem performance on a high-latency filesystem
## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is
## We use self._preload_threads here as both the number of workers and
## the number of in-flight futures that this lazy map executor keeps running.
##
## The goal here is actually maximum filesystem performance on a high-latency filesystem.
## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is an
## empirically chosen near-optimal default for preloading performance on that system.

Copilot uses AI. Check for mistakes.
mtauraso and others added 5 commits February 4, 2026 10:31
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions
Copy link

github-actions bot commented Feb 4, 2026

Before [f90813e] After [ef23f0e] Ratio Benchmark (Parameter)
failed failed n/a data_cache_benchmarks.DataCacheBenchmarks.time_preload_cache_cifar
failed failed n/a data_cache_benchmarks.DataCacheBenchmarks.track_cache_cifar_hyrax_size_undercount
196±1ms 199±2ms 1.02 benchmarks.time_import
37.1±0.3ms 38.0±0.3ms 1.02 benchmarks.time_nb_obj_dir
1.64G 1.67G 1.02 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'chromadb')
36.0±0.1ms 36.5±0.2ms 1.01 benchmarks.time_nb_obj_construct
1.90±0.01s 1.92±0.02s 1.01 benchmarks.time_prepare_help
1.91±0.02s 1.92±0.01s 1.01 benchmarks.time_rebuild_manifest_help
1.89±0.02s 1.90±0.02s 1.01 benchmarks.time_visualize_help
3.74G 3.77G 1.01 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'qdrant')

Click here to view all benchmarks.

Copy link
Collaborator

@drewoldag drewoldag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I see here seems fine. I like that you can easily adjust the threads now. I am a little concerned that a general user reading the comment in the default config might not be able to get out of it what they need.

Comment on lines +289 to +290
# and your storage layer has high latency or is IOPS limited. Avoid writing your dataset class this way,
# but if you do then a large number of preload threads will maximize io throughput
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that general users are going to understand what you mean with these last two lines. How will a user know if their storage layer has high latency or is IOPS limited. What are they to avoid when writing their dataset class? What is a "large" number of threads?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, You're right. I'll rewrite this and try to give some specific guidance of "if pre-caching is slow try..."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments