Skip to content

Reconcile df control-plane with the durable engine#283

Merged
pinodeca merged 1 commit into
mainfrom
tjgreen42/duroxide-reaper
Jul 2, 2026
Merged

Reconcile df control-plane with the durable engine#283
pinodeca merged 1 commit into
mainfrom
tjgreen42/duroxide-reaper

Conversation

@tjgreen42

@tjgreen42 tjgreen42 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Problem

df.start() writes its df control-plane rows in the caller's transaction but hands the workflow to the durable engine over a separate connection. Two gaps followed from that split:

  • Orphaned engine state. If the caller's transaction rolls back, the df rows vanish while the engine keeps an inert record that can never load its (rolled-back) graph and ends up Failed. Nothing reclaimed it.
  • Engine state outliving df state. The routine that deleted old terminal instances only removed df rows; the matching engine records were never retired, so engine state accumulated without bound.

Change

The background worker already ran a periodic pass — roughly once an hour — that removed old terminal (completed/failed/cancelled) rows from df.instances and df.nodes so those control-plane tables don't grow without bound. It only ever touched the df tables, never the engine.

This turns that pass into a reconciliation of the df control-plane and the engine. Each pass:

  1. Removes expired terminal instances — their df.instances/df.nodes rows and their engine records.
  2. Reclaims orphaned engine records — the inert, df-less record a rolled-back df.start() leaves behind — once it ages past the retention window.

Anything the engine is still tracking with a live df row is left untouched.

Configuration

Two Postmaster-context GUCs govern reconciliation, replacing the hardcoded constants:

GUC Default Meaning
pg_durable.retention_days 30 Days a terminal instance and its engine record are retained; also the age bound for reclaiming orphaned engine records. 0 removes terminal instances as soon as the next pass runs.
pg_durable.reconcile_interval 3600 Seconds between reconciliation passes. 0 disables reconciliation.

A fixed hard cap of 10,000 retained terminal instances still applies. Both settings are documented in USER_GUIDE.md.

Testing

  • Unit: expired-instance selection respects age and the hard cap; select_orphans excludes df-backed and sub-orchestration ids.
  • E2E (54_reconcile_orphans): a rolled-back df.start() orphan is reclaimed while a live df-backed instance is left intact.

@tjgreen42 tjgreen42 force-pushed the tjgreen42/duroxide-reaper branch from 39eb10c to 8eb4e86 Compare July 1, 2026 20:25
@tjgreen42 tjgreen42 changed the title Add a background reaper for orphaned durable-engine state Reclaim orphaned durable-engine state via background cleanup Jul 1, 2026
@tjgreen42 tjgreen42 force-pushed the tjgreen42/duroxide-reaper branch 5 times, most recently from f8527ca to e29e530 Compare July 2, 2026 01:12
df.start() writes its df rows in the caller's transaction but hands the
workflow to the engine over a separate connection, so a rolled-back
df.start() leaves the engine holding an inert record with no df row.
Nothing reclaimed that engine state, and the engine records for removed
terminal instances were never retired alongside their df rows.

Turn the existing periodic pass that removes expired terminal instances
into a reconciliation of the df control-plane and the engine. Each pass
removes expired terminal instances (their df rows and engine records) and
reclaims df-less orphaned engine records older than the retention window.

Two Postmaster GUCs govern reconciliation:
- pg_durable.retention_days (default 30) — how long terminal instances and
  their engine records are kept; also the age bound for reclaiming orphans.
- pg_durable.reconcile_interval (default 3600s, 0 disables) — pass cadence.

Covered by expired-instance and select_orphans unit tests and the
54_reconcile_orphans E2E test, which terminates its signal-wait survivor so
no Running orchestration lingers in the shared data directory. The upgrade
harness strips the reconcile phase's reconcile_interval/retention_days from
postgresql.conf so neither leaks into the upgrade run.
@tjgreen42 tjgreen42 force-pushed the tjgreen42/duroxide-reaper branch from e29e530 to 65b69df Compare July 2, 2026 02:37
@tjgreen42 tjgreen42 marked this pull request as ready for review July 2, 2026 02:46
@tjgreen42 tjgreen42 requested review from affandar and pinodeca July 2, 2026 02:46
@tjgreen42 tjgreen42 changed the title Reclaim orphaned durable-engine state via background cleanup Reconcile df control-plane with the durable engine Jul 2, 2026

@pinodeca pinodeca left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pinodeca pinodeca merged commit ae300be into main Jul 2, 2026
5 checks passed
@pinodeca pinodeca deleted the tjgreen42/duroxide-reaper branch July 2, 2026 17:48
@pinodeca pinodeca mentioned this pull request Jul 2, 2026
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants