DM-54879: Add --ignore-existing-metadata-for for when upstream outputs are selectively retained#561
DM-54879: Add --ignore-existing-metadata-for for when upstream outputs are selectively retained#561hsinfang wants to merge 6 commits into
Conversation
d4f69bd to
4bf23e3
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #561 +/- ##
==========================================
+ Coverage 88.77% 88.83% +0.05%
==========================================
Files 159 159
Lines 22053 22143 +90
Branches 2623 2625 +2
==========================================
+ Hits 19578 19670 +92
+ Misses 1839 1835 -4
- Partials 636 638 +2 ☔ View full report in Codecov by Sentry. |
The skip_existing_in behavior of QuantumGraphBuilder was previously only covered through test_separable_pipeline_executor.py, where SeparablePipelineExecutor drives AllDimensionsQuantumGraphBuilder. No tests exercised the builder directly at the unit level.
For tasks listed in ignore_metadata_for, _skip_quantum_if_metadata_exists changes the completion signal from task metadata to all non-metadata outputs existing in skip_existing_in. A quantum is not skipped when any non-metadata output is absent; skipped when all are present. When prior outputs are visible, the quantum is still skipped and discard_output_in_the_way runs. When outputs are absent, the quantum is not skipped and they cannot be "in the way" in a fresh output_run.
The completion signal for tasks in `_ignore_metadata_for` previously
checked all non-metadata outputs, including log datasets ({task}_log).
This caused tasks whose logs were not retained to always not be skipped,
even when all science outputs were present.
Excluding log datasets from the output check means whether logs were
retained is irrelevant to whether science outputs need to be regenerated.
SeparablePipelineExecutor is not used by pipetask, but we might as well extend the same option and get tested there.
4bf23e3 to
d6ef7aa
Compare
| return zstandard.ZstdCompressionDict(b"") | ||
| self.comms.log.info("Training compression dictionary.") | ||
| training_inputs: list[bytes] = [] | ||
| training_inputs: list[bytes | bytearray | memoryview[int]] = [] |
There was a problem hiding this comment.
I'm curious where this is coming from; AFAIK we don't use bytearray or memoryview[int] for any of these.
| repodir = tempfile.TemporaryDirectory() | ||
| self.addCleanup(tempfile.TemporaryDirectory.cleanup, repodir) | ||
| pipeline = simpleQGraph.makeSimplePipeline(nQuanta=1) | ||
| butler, _ = simpleQGraph.makeSimpleQGraph(root=repodir.name, pipeline=pipeline, nQuanta=1) |
There was a problem hiding this comment.
I don't think this is worth changing at this point, but for new tests I'd generally recommend lsst.pipe.base.tests.mocks.InMemoryRepo or lsst.pipe.base.tests.mocks.DirectButlerRepo (see test_predicted_qg.py for an example). That also preps a butler repo with input datasets and helps you build a pipeline, and while it's just about as concise for simple cases, it's also extensible to more complex graphs.
| tasks a quantum is skipped only when all of its non-metadata outputs | ||
| are present in ``skip_existing_in``; it is not skipped when any | ||
| non-metadata output is absent. This is useful for pipelines | ||
| where some upstream tasks do not retain all of their outputs, so that |
There was a problem hiding this comment.
Reword to "where upstream processing does not retain all task outputs" (it's not the tasks that are retaining or not).
| when all of its science outputs are present in ``skip_existing_in``. | ||
| If any such output is absent the quantum is not skipped, so the task | ||
| can regenerate the missing outputs. This supports pipelines where | ||
| upstream tasks do not retain all of their outputs. |
There was a problem hiding this comment.
Reword as in the other thread.
| self.butler = butler.clone(collections=input_collections) | ||
| self.output_run = output_run | ||
| self.skip_existing_in = skip_existing_in | ||
| self.ignore_metadata_for: frozenset[str] = frozenset(ignore_metadata_for) |
There was a problem hiding this comment.
We probably want to raise here if ignore_metadata_for is not empty but skip_existing_in empty (since then ignore_metadata_for would have no effect).
Checklist
package-docs builddoc/changes