Skip to content

Reconcile df.start() with the durable engine via committed intent#281

Closed
tjgreen42 wants to merge 2 commits into
mainfrom
tjgreen42/reconcile-start
Closed

Reconcile df.start() with the durable engine via committed intent#281
tjgreen42 wants to merge 2 commits into
mainfrom
tjgreen42/reconcile-start

Conversation

@tjgreen42

@tjgreen42 tjgreen42 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Problem

df.start() writes its df rows in the caller's transaction but starts the workflow engine out-of-band on a separate connection. Those two writes can diverge:

  • Ghost — a rolled-back df.start() leaves the engine running a workflow with no df record behind it.
  • Stuck — a committed df instance whose engine start was lost (worker down, dropped connection) never runs.

Root cause: the df write and the engine start happen in two different transactions with nothing reconciling them.

Approach

Treat the committed df row as the source of truth (intent) and let the background worker converge the engine to match, through duroxide's Rust API (Client::start_orchestration / get_orchestration_status).

  • df.start() persists a start_input payload on df.instances and issues a transactional NOTIFY; it no longer starts the engine itself.
  • The worker LISTENs for instant starts and periodically sweeps for committed-but-unstarted instances the engine does not yet know about, starting each exactly once via an atomic claim.

The df write and the engine start stay in separate transactions and are reconciled asynchronously: a rolled-back df.start() starts nothing (no ghost), and a committed instance is always eventually started (no stuck).

Exactly-once, crash recovery, and trust boundary

  • Exactly-once: the instant and sweep paths both claim the row with an atomic UPDATE ... RETURNING; only the winner starts the engine.
  • Crash recovery: the claim commits before the engine start, so a worker crash in between would strand the row in running. The sweep therefore also recovers stale claims — a running row the engine reports NotFound for whose claim is older than the grace is re-claimed and re-started.
  • Security: start_input is caller-writable and the orchestration selects the graph/role from the input's instance id. The worker forces the engine input's instance id to the claimed row id, and execute() uses the durable runtime's ctx.instance_id() instead of the payload's — so a crafted start_input cannot redirect the superuser worker to run another user's instance.
  • Resilience: the start listener is re-established if it drops (so reconcile_interval=0 still keeps the instant path); the sweep is batch-bounded and yields to shutdown; a startup catch-up sweep recovers starts whose NOTIFY was missed during downtime.

Changes

  • src/dsl.rsdf.start() intent path (persist start_input, transactional pg_notify); legacy inline start retained for pre-0.2.4 schemas (version-gated).
  • src/worker.rs — duroxide Client, LISTEN + reconcile sweep with stale-claim recovery, exactly-once claim, listener rebuild, bounded batches.
  • src/orchestrations/execute_function_graph.rs — trust ctx.instance_id() over the caller-supplied payload.
  • Schema (0.2.4)df.instances.start_input JSONB (fresh install + upgrade script), grant/revoke + backfill for existing df roles. GUCs pg_durable.reconcile_interval (15s) / reconcile_grace (300s).
  • Docs/testsdocs/spec-reconcile-start.md, USER_GUIDE section, upgrade-testing notes; tests/e2e/sql/25_reconcile_start.sql (ghost / stuck / stale-claim, RED on main), tests/e2e/sql/26_reconcile_security.sql (cross-tenant redirect blocked), upgrade grant-backfill coverage.

Backward compatibility

New .so runs against un-upgraded 0.2.2/0.2.3 schemas: df.start() falls back to the inline start until the column exists, and the worker re-probes so an in-place ALTER EXTENSION UPDATE enables reconciliation without a restart.

Scope

Start-convergence only. The engine-orphan cleanup direction (engine has it, df does not) is out of scope; df.cancel()/df.signal() stay on the existing path, with a targeted re-cancel if an instance is cancelled between the claim and the start.

Testing

build · clippy (nightly-equivalent) · fmt · unit 194 · E2E 38 · upgrade 37 — all green locally and in CI.

tjgreen42 added 2 commits July 1, 2026 03:51
df.start() previously wrote its df rows in the caller's transaction but
started the workflow engine out-of-band on a separate connection. Those
two writes could diverge:

- Ghost: a rolled-back df.start() left the engine running a workflow with
  no df record behind it.
- Stuck: a committed df instance whose engine start was lost never ran.

Record the start as committed intent and let the background worker
converge the engine to match, going only through duroxide's Rust API (no
shared transaction, no touching duroxide's internal schema):

- df.start() persists a start_input payload on df.instances and issues a
  transactional NOTIFY; it no longer starts the engine itself.
- The worker LISTENs for instant starts and periodically sweeps for
  committed-but-unstarted instances the engine does not yet know about,
  starting each exactly once via an atomic pending->running claim.

A rolled-back df.start() therefore starts nothing, and a committed
instance is always eventually started.

Schema (0.2.4): add df.instances.start_input JSONB (fresh install +
upgrade script), extend the grant/revoke INSERT column list, and backfill
the column grant to existing df-usage roles. Two Postmaster GUCs control
the sweep: pg_durable.reconcile_interval (15s) and reconcile_grace (300s).
Backward compatible: on a pre-0.2.4 schema df.start() keeps the inline
start, and the worker re-probes for the column so an in-place ALTER
EXTENSION UPDATE enables reconciliation without a restart.

Tests: tests/e2e/sql/25_reconcile_start.sql exercises both failure modes
(RED on main); upgrade test adds grant-backfill coverage. Docs: USER_GUIDE
section, docs/spec-reconcile-start.md, upgrade-testing notes.
Address findings from the PR review:

- Security (High): the worker replayed the caller-writable start_input
  verbatim, and the orchestration trusted the instance id embedded in it to
  load the graph (on the superuser pool, bypassing RLS). A low-privilege user
  could INSERT a crafted pending row whose start_input pointed at another
  user's instance and have the worker run the victim's workflow as the victim.
  The worker now forces the engine input's instance id to the claimed row id,
  and execute() uses the durable runtime's instance id (ctx.instance_id())
  instead of the payload's. Added tests/e2e/sql/26_reconcile_security.sql.

- Crash window (High): the claim (pending -> running) commits before the
  engine start, so a worker crash in between stranded the row in running
  forever (the pending-only sweep could never recover it). The sweep now also
  reconciles stale claims: a running row the engine reports NotFound for whose
  claim is older than the grace is re-claimed and re-started. Added a
  stale-running recovery scenario to 25_reconcile_start.sql.

- Listener recovery (Medium): the start listener was created once and dropped
  permanently on any recv error, so with reconcile_interval=0 a single error
  silently stopped all starts. The backstop tick now re-establishes the
  listener when it drops, honoring the "interval=0 keeps the instant path"
  contract.

- Bounded sweep (Low): the reconcile sweep now LIMITs its batch and checks for
  shutdown between iterations so a large backlog can't starve shutdown/drop
  detection.

- Cancel race: if an instance is cancelled between the claim and the start, the
  worker re-issues the cancel to the engine so the workflow does not run while
  df reads cancelled.

- Restart latency: a startup catch-up sweep (zero grace) starts instances whose
  NOTIFY was missed while the worker was down, instead of waiting out the grace.
@tjgreen42

Copy link
Copy Markdown
Contributor Author

Superseding this approach. After review discussion we're pivoting away from deferring the start (persist-intent + worker-driven start) to a simpler model: keep the eager start in df.start() but propagate the enqueue error (fail-fast), and add a periodic background reaper that drives duroxide's retention/deletion API. The rolled-back-start 'ghost' is provably inert (the orchestration fails at graph-load before any node runs), so it only needs cleanup, not prevention — and that cleanup doubles as the duroxide-side retention pg_durable currently lacks. Replacement PRs to follow.

@tjgreen42 tjgreen42 closed this Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant