Skip to content

Fail fast in df.start() when the engine hand-off fails#282

Merged
pinodeca merged 1 commit into
mainfrom
tjgreen42/df-start-fail-fast
Jul 1, 2026
Merged

Fail fast in df.start() when the engine hand-off fails#282
pinodeca merged 1 commit into
mainfrom
tjgreen42/df-start-fail-fast

Conversation

@tjgreen42

Copy link
Copy Markdown
Contributor

Problem

df.start() writes its df.instances / df.nodes rows in the caller's transaction, then hands the workflow to the durable engine over a separate connection. If that hand-off fails, the error is currently swallowed (logged) and df.start() returns normally — so the caller commits an instance row for a workflow that was never started and will never run.

Change

On a failed hand-off, df.start() now raises, aborting the caller's transaction so the instance rows roll back with it. The caller gets a clear, retryable error instead of a silently stuck instance.

The pgrx unit-test build does not run the background worker, so the hand-off always fails there; that build logs instead of aborting (matching the existing validate_database test-build carve-out) so df.start()'s graph construction stays unit-testable.

Tests

tests/e2e/sql/25_start_fail_fast.sql forces the hand-off to fail (marks the worker not-ready) and asserts df.start() raises and commits no row, then confirms a normal start still completes once readiness is restored.

Local: unit 194 · E2E 37 · upgrade 36 · fmt/clippy clean.

df.start() writes its df rows in the caller's transaction and hands the
workflow to the durable engine over a separate connection. A failed
hand-off was swallowed (logged), so the caller committed an instance row
for a workflow that never started and would never run.

Raise on a failed hand-off instead, aborting the caller's transaction so
the instance rows roll back with it. The caller gets a clear, retryable
error rather than a silently stuck instance.

The pgrx unit-test build does not run the background worker, so the
hand-off always fails there; that build logs instead of aborting (matching
the existing validate_database test-build carve-out) so df.start()'s graph
construction stays unit-testable.

Test: tests/e2e/sql/25_start_fail_fast.sql.

@pinodeca pinodeca left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pinodeca pinodeca merged commit 608fe97 into main Jul 1, 2026
8 of 9 checks passed
@pinodeca pinodeca deleted the tjgreen42/df-start-fail-fast branch July 1, 2026 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants