Reconcile df control-plane with the durable engine#283
Merged
Conversation
This was referenced Jul 1, 2026
39eb10c to
8eb4e86
Compare
f8527ca to
e29e530
Compare
df.start() writes its df rows in the caller's transaction but hands the workflow to the engine over a separate connection, so a rolled-back df.start() leaves the engine holding an inert record with no df row. Nothing reclaimed that engine state, and the engine records for removed terminal instances were never retired alongside their df rows. Turn the existing periodic pass that removes expired terminal instances into a reconciliation of the df control-plane and the engine. Each pass removes expired terminal instances (their df rows and engine records) and reclaims df-less orphaned engine records older than the retention window. Two Postmaster GUCs govern reconciliation: - pg_durable.retention_days (default 30) — how long terminal instances and their engine records are kept; also the age bound for reclaiming orphans. - pg_durable.reconcile_interval (default 3600s, 0 disables) — pass cadence. Covered by expired-instance and select_orphans unit tests and the 54_reconcile_orphans E2E test, which terminates its signal-wait survivor so no Running orchestration lingers in the shared data directory. The upgrade harness strips the reconcile phase's reconcile_interval/retention_days from postgresql.conf so neither leaks into the upgrade run.
e29e530 to
65b69df
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
df.start()writes itsdfcontrol-plane rows in the caller's transaction but hands the workflow to the durable engine over a separate connection. Two gaps followed from that split:dfrows vanish while the engine keeps an inert record that can never load its (rolled-back) graph and ends upFailed. Nothing reclaimed it.dfstate. The routine that deleted old terminal instances only removeddfrows; the matching engine records were never retired, so engine state accumulated without bound.Change
The background worker already ran a periodic pass — roughly once an hour — that removed old terminal (
completed/failed/cancelled) rows fromdf.instancesanddf.nodesso those control-plane tables don't grow without bound. It only ever touched thedftables, never the engine.This turns that pass into a reconciliation of the
dfcontrol-plane and the engine. Each pass:df.instances/df.nodesrows and their engine records.df.start()leaves behind — once it ages past the retention window.Anything the engine is still tracking with a live
dfrow is left untouched.Configuration
Two Postmaster-context GUCs govern reconciliation, replacing the hardcoded constants:
pg_durable.retention_days300removes terminal instances as soon as the next pass runs.pg_durable.reconcile_interval36000disables reconciliation.A fixed hard cap of 10,000 retained terminal instances still applies. Both settings are documented in
USER_GUIDE.md.Testing
select_orphansexcludes df-backed and sub-orchestration ids.54_reconcile_orphans): a rolled-backdf.start()orphan is reclaimed while a live df-backed instance is left intact.