fix: exit cleanly on fatal startup errors instead of crash-looping (#4253)#4601
Open
apotema wants to merge 1 commit into
Open
fix: exit cleanly on fatal startup errors instead of crash-looping (#4253)#4601apotema wants to merge 1 commit into
apotema wants to merge 1 commit into
Conversation
…okploy#4253) After "Migration complete", a fatal error in the server startup window (e.g. a rejected `app.prepare()`) did not terminate the process. Background handles — the ioredis reconnect loop, open sockets — kept the event loop alive, so instead of exiting the process spun at high CPU, never passed the healthcheck, and Docker Swarm crash-looped the container showing only "ELIFECYCLE Command failed." Reproduced with the real compiled bundle (node:24.4.0-slim + Postgres 16): - before: fatal startup error -> process never exits, spins until killed - after: fatal startup error -> logs the cause, exits 1 in ~3-5s Changes in apps/dokploy/server/server.ts: - Phase-gated process handlers: before the HTTP server is listening, an uncaught exception or unhandled rejection logs the cause and exit(1)s so the orchestrator restarts cleanly. After it is listening, a stray rejection is only logged, so a healthy serving instance is never killed (verified: real post-listen docker.sock ENOENT rejection is survived). - try/catch around the synchronous directory/Traefik init. - await the listen() bind so a bind failure (e.g. EADDRINUSE) exits instead of spinning; only mark the server ready once actually listening. - .catch() on app.prepare() with a labeled diagnostic.
e2a1f9f to
9e122de
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Fixes #4253. After
Migration complete, a fatal error in the server startup window did not terminate the process. Background handles — the ioredis reconnect loop, open sockets — keep the event loop alive, so instead of exiting, the process spins at high CPU, never passes the healthcheck (/api/trpc/settings.health), and Docker Swarm crash-loops the container showing only:This matches every detail the reporter described: the crash loop pegging CPU (their 700–900%), the server never reaching "Server Started", and Swarm marking the task
non-zero exit (1): unhealthy container.Root cause
In
apps/dokploy/server/server.ts:setupDirectories()/createDefaultTraefikConfig()block had notry/catch.app.prepare()had error handling inside its.then()but no.catch()— a rejectedprepare()became an unhandled rejection.uncaughtException/unhandledRejectionhandlers, and a dependency registers a logging-onlyunhandledRejectionlistener that suppresses Node's default "crash on unhandled rejection". So a fatal startup error is logged (or not) but the process never exits — it just spins.Fix
Phase-gated error handling, so a failed startup exits cleanly without destabilizing a healthy server:
prepare()rejection / bind failure logs the cause andexit(1)s, so the orchestrator restarts cleanly instead of spinning.awaitthelisten()bind so a bind failure (e.g.EADDRINUSE) exits instead of spinning; mark the server ready only once actually listening.try/catcharound the synchronous directory/Traefik init;.catch()onapp.prepare()with a labeled diagnostic.Verification (real compiled bundle)
Built the real
dist/migration.mjs+dist/server.mjsfrom this branch (node:24.4.0-slim, real Postgres 16 + Redis on a Docker network) and exercised the actualmigration → serverboot:canaryprepare()failureFailed to prepare…, clean exit 1 in ~5sServer Started on, healthcheck HTTP 200, runningdocker.sockENOENT)The post-listen case is why the handlers are phase-gated: a naive "always
exit(1)on unhandled rejection" would kill a healthy server on that exact backgrounddocker.sockrejection.Scope
This fixes the crash-loop mechanism — any fatal startup failure now produces a clean, logged
exit(1)instead of a silent high-CPU spin. The separate report in the comments (thedokploy-postgresrole disappearing after ~1–2h) is a distinct database-lifecycle issue this PR does not address; with this change, that failure would surface as a clean diagnostic exit rather than a silent spin.Testing
tsc --noEmitclean;biome checkclean.