Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #673 +/- ##
==========================================
- Coverage 62.95% 62.94% -0.02%
==========================================
Files 58 58
Lines 5782 5780 -2
==========================================
- Hits 3640 3638 -2
Misses 2142 2142 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR moves the DataCache preload thread count from a hard-coded value to a runtime configuration option.
Changes:
- Add
data_set.preload_threads(default50) tohyrax_default_config.toml. - Update
DataCacheto useconfig["data_set"]["preload_threads"]forThreadPoolExecutor(max_workers=...)and for the lazy executor’s in-flight futures limit. - Remove the previously hard-coded
_determine_numprocs_preload()helper.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/hyrax/hyrax_default_config.toml |
Introduces the preload_threads configuration knob for cache preloading. |
src/hyrax/data_sets/data_cache.py |
Switches preload concurrency to be driven by data_set.preload_threads instead of a hard-coded constant. |
src/hyrax/data_sets/data_cache.py
Outdated
| # Save config we need | ||
| self._use_cache = config["data_set"]["use_cache"] | ||
| self._preload_cache = config["data_set"]["preload_cache"] | ||
| self._preload_threads = config["data_set"]["preload_threads"] |
There was a problem hiding this comment.
preload_threads is read unconditionally from config["data_set"], which will raise a KeyError for configs created from older saved runs (e.g., InferenceDataSet loads original_dataset_config.toml via ConfigManager.read_runtime_config(...) without merging defaults) or any caller that passes a partial config dict. Since this value is only needed when preload_cache is enabled, either (a) default it with config["data_set"].get("preload_threads", 50) and cast/validate it as a positive int, or (b) only read/validate it inside the if self._preload_cache and self._use_cache branch.
| self._preload_threads = config["data_set"]["preload_threads"] | |
| preload_threads = config["data_set"].get("preload_threads", 50) | |
| try: | |
| preload_threads = int(preload_threads) | |
| except (TypeError, ValueError) as exc: | |
| raise ValueError( | |
| "config['data_set']['preload_threads'] must be an integer, " | |
| f"got {preload_threads!r}" | |
| ) from exc | |
| if preload_threads <= 0: | |
| raise ValueError( | |
| "config['data_set']['preload_threads'] must be a positive integer, " | |
| f"got {preload_threads!r}" | |
| ) | |
| self._preload_threads = preload_threads |
src/hyrax/data_sets/data_cache.py
Outdated
| ## We use self._preload threads here as both the number of workers and | ||
| ## the number of in-flight futures that this lazy map executor keeps running | ||
| ## | ||
| ## The goal here is actually maximum filesystem performance on a high-latency filesystem | ||
| ## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is |
There was a problem hiding this comment.
The new block comment is internally inconsistent/incomplete: it refers to self._preload threads (missing _threads) and ends the sentence "where 50 threads is" without finishing the thought. This makes the guidance hard to follow; please correct the attribute name in the comment and complete or remove the unfinished sentence.
| ## We use self._preload threads here as both the number of workers and | |
| ## the number of in-flight futures that this lazy map executor keeps running | |
| ## | |
| ## The goal here is actually maximum filesystem performance on a high-latency filesystem | |
| ## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is | |
| ## We use self._preload_threads here as both the number of workers and | |
| ## the number of in-flight futures that this lazy map executor keeps running. | |
| ## | |
| ## The goal here is actually maximum filesystem performance on a high-latency filesystem. | |
| ## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is an | |
| ## empirically chosen near-optimal default for preloading performance on that system. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Click here to view all benchmarks. |
drewoldag
left a comment
There was a problem hiding this comment.
What I see here seems fine. I like that you can easily adjust the threads now. I am a little concerned that a general user reading the comment in the default config might not be able to get out of it what they need.
| # and your storage layer has high latency or is IOPS limited. Avoid writing your dataset class this way, | ||
| # but if you do then a large number of preload threads will maximize io throughput |
There was a problem hiding this comment.
I'm not sure that general users are going to understand what you mean with these last two lines. How will a user know if their storage layer has high latency or is IOPS limited. What are they to avoid when writing their dataset class? What is a "large" number of threads?
There was a problem hiding this comment.
Yeah, You're right. I'll rewrite this and try to give some specific guidance of "if pre-caching is slow try..."
No description provided.