Skip to content

fix: exit cleanly on fatal startup errors instead of crash-looping (#4253)#4601

Open
apotema wants to merge 1 commit into
Dokploy:canaryfrom
apotema:fix/4253-silent-server-crash-diagnostics
Open

fix: exit cleanly on fatal startup errors instead of crash-looping (#4253)#4601
apotema wants to merge 1 commit into
Dokploy:canaryfrom
apotema:fix/4253-silent-server-crash-diagnostics

Conversation

@apotema

@apotema apotema commented Jun 9, 2026

Copy link
Copy Markdown

Problem

Fixes #4253. After Migration complete, a fatal error in the server startup window did not terminate the process. Background handles — the ioredis reconnect loop, open sockets — keep the event loop alive, so instead of exiting, the process spins at high CPU, never passes the healthcheck (/api/trpc/settings.health), and Docker Swarm crash-loops the container showing only:

Using Docker socket (Standard Docker socket): /var/run/docker.sock
ELIFECYCLE  Command failed.

This matches every detail the reporter described: the crash loop pegging CPU (their 700–900%), the server never reaching "Server Started", and Swarm marking the task non-zero exit (1): unhealthy container.

Root cause

In apps/dokploy/server/server.ts:

  • The top-level setupDirectories() / createDefaultTraefikConfig() block had no try/catch.
  • app.prepare() had error handling inside its .then() but no .catch() — a rejected prepare() became an unhandled rejection.
  • There were no process-level uncaughtException / unhandledRejection handlers, and a dependency registers a logging-only unhandledRejection listener that suppresses Node's default "crash on unhandled rejection". So a fatal startup error is logged (or not) but the process never exits — it just spins.

Fix

Phase-gated error handling, so a failed startup exits cleanly without destabilizing a healthy server:

  • Before the HTTP server is listening → an uncaught exception / unhandled rejection / sync-init throw / prepare() rejection / bind failure logs the cause and exit(1)s, so the orchestrator restarts cleanly instead of spinning.
  • After it is listening → a stray unhandled rejection is only logged, so an otherwise-healthy serving instance is never killed.
  • await the listen() bind so a bind failure (e.g. EADDRINUSE) exits instead of spinning; mark the server ready only once actually listening.
  • try/catch around the synchronous directory/Traefik init; .catch() on app.prepare() with a labeled diagnostic.

Verification (real compiled bundle)

Built the real dist/migration.mjs + dist/server.mjs from this branch (node:24.4.0-slim, real Postgres 16 + Redis on a Docker network) and exercised the actual migration → server boot:

Scenario canary This PR
Pre-listen prepare() failure spins forever, high CPU, killed at timeout (exit 124) logs Failed to prepare…, clean exit 1 in ~5s
Normal boot reaches Server Started on, healthcheck HTTP 200, running
Real post-listen rejection (docker.sock ENOENT) logged, server survives and stays healthy

The post-listen case is why the handlers are phase-gated: a naive "always exit(1) on unhandled rejection" would kill a healthy server on that exact background docker.sock rejection.

Scope

This fixes the crash-loop mechanism — any fatal startup failure now produces a clean, logged exit(1) instead of a silent high-CPU spin. The separate report in the comments (the dokploy-postgres role disappearing after ~1–2h) is a distinct database-lifecycle issue this PR does not address; with this change, that failure would surface as a clean diagnostic exit rather than a silent spin.

Testing

  • tsc --noEmit clean; biome check clean.
  • Reproduced before/after with the real bundle as above.

@apotema apotema requested a review from Siumauricio as a code owner June 9, 2026 13:10
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jun 9, 2026
…okploy#4253)

After "Migration complete", a fatal error in the server startup window (e.g. a
rejected `app.prepare()`) did not terminate the process. Background handles —
the ioredis reconnect loop, open sockets — kept the event loop alive, so instead
of exiting the process spun at high CPU, never passed the healthcheck, and Docker
Swarm crash-looped the container showing only "ELIFECYCLE  Command failed."

Reproduced with the real compiled bundle (node:24.4.0-slim + Postgres 16):
- before: fatal startup error -> process never exits, spins until killed
- after:  fatal startup error -> logs the cause, exits 1 in ~3-5s

Changes in apps/dokploy/server/server.ts:
- Phase-gated process handlers: before the HTTP server is listening, an uncaught
  exception or unhandled rejection logs the cause and exit(1)s so the orchestrator
  restarts cleanly. After it is listening, a stray rejection is only logged, so a
  healthy serving instance is never killed (verified: real post-listen docker.sock
  ENOENT rejection is survived).
- try/catch around the synchronous directory/Traefik init.
- await the listen() bind so a bind failure (e.g. EADDRINUSE) exits instead of
  spinning; only mark the server ready once actually listening.
- .catch() on app.prepare() with a labeled diagnostic.
@apotema apotema force-pushed the fix/4253-silent-server-crash-diagnostics branch from e2a1f9f to 9e122de Compare June 9, 2026 13:38
@apotema apotema changed the title fix: log fatal startup errors instead of crashing silently (#4253) fix: exit cleanly on fatal startup errors instead of crash-looping (#4253) Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v0.29.0: Silent ELIFECYCLE crash after "Migration complete" - no error output, crash loops indefinitely

1 participant