DM-54879: Add --ignore-existing-metadata-for for when upstream outputs are selectively retained by hsinfang · Pull Request #561 · lsst/pipe_base

hsinfang · 2026-05-12T22:26:36Z

Checklist

ran Jenkins
ran and inspected package-docs build
added a release note for user-visible changes to doc/changes

codecov · 2026-05-12T22:34:55Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.83%. Comparing base (6cec3c9) to head (2e9ac52).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #561      +/-   ##
==========================================
+ Coverage   88.77%   88.83%   +0.05%     
==========================================
  Files         159      159              
  Lines       22053    22143      +90     
  Branches     2623     2625       +2     
==========================================
+ Hits        19578    19670      +92     
+ Misses       1839     1835       -4     
- Partials      636      638       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

The skip_existing_in behavior of QuantumGraphBuilder was previously only covered through test_separable_pipeline_executor.py, where SeparablePipelineExecutor drives AllDimensionsQuantumGraphBuilder. No tests exercised the builder directly at the unit level.

For tasks listed in ignore_metadata_for, _skip_quantum_if_metadata_exists changes the completion signal from task metadata to all non-metadata outputs existing in skip_existing_in. A quantum is not skipped when any non-metadata output is absent; skipped when all are present. When prior outputs are visible, the quantum is still skipped and discard_output_in_the_way runs. When outputs are absent, the quantum is not skipped and they cannot be "in the way" in a fresh output_run.

The completion signal for tasks in `_ignore_metadata_for` previously checked all non-metadata outputs, including log datasets ({task}_log). This caused tasks whose logs were not retained to always not be skipped, even when all science outputs were present. Excluding log datasets from the output check means whether logs were retained is irrelevant to whether science outputs need to be regenerated.

SeparablePipelineExecutor is not used by pipetask, but we might as well extend the same option and get tested there.

TallJimbo · 2026-05-21T22:46:03Z

            return zstandard.ZstdCompressionDict(b"")
        self.comms.log.info("Training compression dictionary.")
-        training_inputs: list[bytes] = []
+        training_inputs: list[bytes | bytearray | memoryview[int]] = []


I'm curious where this is coming from; AFAIK we don't use bytearray or memoryview[int] for any of these.

TallJimbo · 2026-05-21T22:57:48Z

+        repodir = tempfile.TemporaryDirectory()
+        self.addCleanup(tempfile.TemporaryDirectory.cleanup, repodir)
+        pipeline = simpleQGraph.makeSimplePipeline(nQuanta=1)
+        butler, _ = simpleQGraph.makeSimpleQGraph(root=repodir.name, pipeline=pipeline, nQuanta=1)


I don't think this is worth changing at this point, but for new tests I'd generally recommend lsst.pipe.base.tests.mocks.InMemoryRepo or lsst.pipe.base.tests.mocks.DirectButlerRepo (see test_predicted_qg.py for an example). That also preps a butler repo with input datasets and helps you build a pipeline, and while it's just about as concise for simple cases, it's also extensible to more complex graphs.

TallJimbo · 2026-05-22T01:07:50Z

+        tasks a quantum is skipped only when all of its non-metadata outputs
+        are present in ``skip_existing_in``; it is not skipped when any
+        non-metadata output is absent.  This is useful for pipelines
+        where some upstream tasks do not retain all of their outputs, so that


Reword to "where upstream processing does not retain all task outputs" (it's not the tasks that are retaining or not).

TallJimbo · 2026-05-22T01:11:32Z

+        when all of its science outputs are present in ``skip_existing_in``.
+        If any such output is absent the quantum is not skipped, so the task
+        can regenerate the missing outputs.  This supports pipelines where
+        upstream tasks do not retain all of their outputs.


Reword as in the other thread.

TallJimbo · 2026-05-22T01:42:46Z

        self.butler = butler.clone(collections=input_collections)
        self.output_run = output_run
        self.skip_existing_in = skip_existing_in
+        self.ignore_metadata_for: frozenset[str] = frozenset(ignore_metadata_for)


We probably want to raise here if ignore_metadata_for is not empty but skip_existing_in empty (since then ignore_metadata_for would have no effect).

hsinfang force-pushed the tickets/DM-54879 branch from d4f69bd to 4bf23e3 Compare May 12, 2026 22:31

hsinfang added 5 commits May 12, 2026 15:45

Fix mypy type error in _writer.py

0db29d2

Add ignore_metadata_for to SeparablePipelineExecutor

d6ef7aa

SeparablePipelineExecutor is not used by pipetask, but we might as well extend the same option and get tested there.

hsinfang force-pushed the tickets/DM-54879 branch from 4bf23e3 to d6ef7aa Compare May 12, 2026 22:45

Add a note in doc/changes

2e9ac52

TallJimbo approved these changes May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-54879: Add --ignore-existing-metadata-for for when upstream outputs are selectively retained#561

DM-54879: Add --ignore-existing-metadata-for for when upstream outputs are selectively retained#561
hsinfang wants to merge 6 commits into
mainfrom
tickets/DM-54879

hsinfang commented May 12, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 12, 2026 •

edited

Loading

Uh oh!

TallJimbo May 21, 2026

Uh oh!

TallJimbo May 21, 2026

Uh oh!

TallJimbo May 22, 2026

Uh oh!

TallJimbo May 22, 2026

Uh oh!

TallJimbo May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hsinfang commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

codecov Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TallJimbo May 21, 2026

Choose a reason for hiding this comment

Uh oh!

TallJimbo May 21, 2026

Choose a reason for hiding this comment

Uh oh!

TallJimbo May 22, 2026

Choose a reason for hiding this comment

Uh oh!

TallJimbo May 22, 2026

Choose a reason for hiding this comment

Uh oh!

TallJimbo May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hsinfang commented May 12, 2026 •

edited

Loading

codecov Bot commented May 12, 2026 •

edited

Loading