Skip to content

DM-54879: Add --ignore-existing-metadata-for for when upstream outputs are selectively retained#561

Open
hsinfang wants to merge 6 commits into
mainfrom
tickets/DM-54879
Open

DM-54879: Add --ignore-existing-metadata-for for when upstream outputs are selectively retained#561
hsinfang wants to merge 6 commits into
mainfrom
tickets/DM-54879

Conversation

@hsinfang
Copy link
Copy Markdown
Contributor

@hsinfang hsinfang commented May 12, 2026

Checklist

  • ran Jenkins
  • ran and inspected package-docs build
  • added a release note for user-visible changes to doc/changes

@hsinfang hsinfang force-pushed the tickets/DM-54879 branch from d4f69bd to 4bf23e3 Compare May 12, 2026 22:31
@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.83%. Comparing base (6cec3c9) to head (2e9ac52).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #561      +/-   ##
==========================================
+ Coverage   88.77%   88.83%   +0.05%     
==========================================
  Files         159      159              
  Lines       22053    22143      +90     
  Branches     2623     2625       +2     
==========================================
+ Hits        19578    19670      +92     
+ Misses       1839     1835       -4     
- Partials      636      638       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

hsinfang added 5 commits May 12, 2026 15:45
The skip_existing_in behavior of QuantumGraphBuilder was previously
only covered through test_separable_pipeline_executor.py, where
SeparablePipelineExecutor drives AllDimensionsQuantumGraphBuilder.
No tests exercised the builder directly at the unit level.
For tasks listed in ignore_metadata_for, _skip_quantum_if_metadata_exists
changes the completion signal from task metadata to all non-metadata outputs
existing in skip_existing_in. A quantum is not skipped when any non-metadata
output is absent; skipped when all are present.

When prior outputs are visible, the quantum is still skipped and
discard_output_in_the_way runs.  When outputs are absent, the quantum is not
skipped and they cannot be "in the way" in a fresh output_run.
The completion signal for tasks in `_ignore_metadata_for` previously
checked all non-metadata outputs, including log datasets ({task}_log).
This caused tasks whose logs were not retained to always not be skipped,
even when all science outputs were present.

Excluding log datasets from the output check means whether logs were
retained is irrelevant to whether science outputs need to be regenerated.
SeparablePipelineExecutor is not used by pipetask, but we might
as well extend the same option and get tested there.
@hsinfang hsinfang force-pushed the tickets/DM-54879 branch from 4bf23e3 to d6ef7aa Compare May 12, 2026 22:45
return zstandard.ZstdCompressionDict(b"")
self.comms.log.info("Training compression dictionary.")
training_inputs: list[bytes] = []
training_inputs: list[bytes | bytearray | memoryview[int]] = []
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious where this is coming from; AFAIK we don't use bytearray or memoryview[int] for any of these.

repodir = tempfile.TemporaryDirectory()
self.addCleanup(tempfile.TemporaryDirectory.cleanup, repodir)
pipeline = simpleQGraph.makeSimplePipeline(nQuanta=1)
butler, _ = simpleQGraph.makeSimpleQGraph(root=repodir.name, pipeline=pipeline, nQuanta=1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is worth changing at this point, but for new tests I'd generally recommend lsst.pipe.base.tests.mocks.InMemoryRepo or lsst.pipe.base.tests.mocks.DirectButlerRepo (see test_predicted_qg.py for an example). That also preps a butler repo with input datasets and helps you build a pipeline, and while it's just about as concise for simple cases, it's also extensible to more complex graphs.

tasks a quantum is skipped only when all of its non-metadata outputs
are present in ``skip_existing_in``; it is not skipped when any
non-metadata output is absent. This is useful for pipelines
where some upstream tasks do not retain all of their outputs, so that
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reword to "where upstream processing does not retain all task outputs" (it's not the tasks that are retaining or not).

when all of its science outputs are present in ``skip_existing_in``.
If any such output is absent the quantum is not skipped, so the task
can regenerate the missing outputs. This supports pipelines where
upstream tasks do not retain all of their outputs.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reword as in the other thread.

self.butler = butler.clone(collections=input_collections)
self.output_run = output_run
self.skip_existing_in = skip_existing_in
self.ignore_metadata_for: frozenset[str] = frozenset(ignore_metadata_for)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to raise here if ignore_metadata_for is not empty but skip_existing_in empty (since then ignore_metadata_for would have no effect).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants