Summary
In the process job executor, the job child's final teardown awaits dispose() (native FFI cleanup) without a timeout before process.exit(0) (agents/src/ipc/job_proc_lazy_main.ts):
await join.await;
try {
await dispose(); // ← unbounded
logger.debug('native resources disposed');
} catch (error) {
logger.warn({ error }, 'failed to dispose native resources');
}
logger.debug('Job process shutdown');
process.exit(0);
If native disposal never returns (a native handle that never drains), the child never reaches process.exit(0) and lingers indefinitely, holding the job's full RSS.
Why nothing reclaims it
- The child's JS event loop is still alive during the wedged native call, so it keeps answering supervisor pings — the orphan/health reaper considers it healthy and never kills it.
process.on('SIGINT' | 'SIGTERM') are installed as no-op log handlers, so the worker's graceful termination is swallowed and the child does not exit.
So a child wedged here is effectively immortal short of SIGKILL.
Production impact
On a long-running worker fleet we see a steady multi-day RSS climb: a few job children per day finish all their shutdown callbacks, then freeze in this final dispose() and never exit, each holding ~1.4 GB (~40% of it native/FFI memory). They accumulate until a task hits its memory cap and the kernel OOM-kills in-flight calls. A redeploy resets it; it recurs every ~2–3 days.
We traced one frozen child end-to-end: every application shutdown callback logged to completion, then the process emitted only periodic logs for the next 43 hours with no exit. By elimination (the shutdown callbacks all complete or are independently bounded), the only unbounded await left before process.exit(0) is this await dispose().
Note
dispose() is correct and necessary here — it was added to prevent the libc++abi mutex crash on exit-with-active-native-threads (livekit/node-sdks#564). The gap is only that the wait is unbounded. This is still the case on main and in the latest 1.4.9.
Related symptoms reported against the Python SDK: livekit/agents#3174 (closed not-planned), livekit/agents#3637, livekit/agents#617.
Proposed fix
Bound dispose() with a timeout and fall through to process.exit(0) on expiry — dispose() is still attempted first (preserving the libc++abi guard on the normal path); the timeout only changes the pathological case from "hang forever" to "exit anyway". A crash on exit is strictly preferable to a zombie holding the job's full RSS. A fix PR follows.
Versions
@livekit/agents 1.4.x (reproduced on 1.4.4; main and 1.4.9 both still unbounded). JobExecutorType.PROCESS.
Summary
In the process job executor, the job child's final teardown awaits
dispose()(native FFI cleanup) without a timeout beforeprocess.exit(0)(agents/src/ipc/job_proc_lazy_main.ts):If native disposal never returns (a native handle that never drains), the child never reaches
process.exit(0)and lingers indefinitely, holding the job's full RSS.Why nothing reclaims it
process.on('SIGINT' | 'SIGTERM')are installed as no-op log handlers, so the worker's graceful termination is swallowed and the child does not exit.So a child wedged here is effectively immortal short of
SIGKILL.Production impact
On a long-running worker fleet we see a steady multi-day RSS climb: a few job children per day finish all their shutdown callbacks, then freeze in this final
dispose()and never exit, each holding ~1.4 GB (~40% of it native/FFI memory). They accumulate until a task hits its memory cap and the kernel OOM-kills in-flight calls. A redeploy resets it; it recurs every ~2–3 days.We traced one frozen child end-to-end: every application shutdown callback logged to completion, then the process emitted only periodic logs for the next 43 hours with no exit. By elimination (the shutdown callbacks all complete or are independently bounded), the only unbounded await left before
process.exit(0)is thisawait dispose().Note
dispose()is correct and necessary here — it was added to prevent the libc++abi mutex crash on exit-with-active-native-threads (livekit/node-sdks#564). The gap is only that the wait is unbounded. This is still the case onmainand in the latest1.4.9.Related symptoms reported against the Python SDK: livekit/agents#3174 (closed not-planned), livekit/agents#3637, livekit/agents#617.
Proposed fix
Bound
dispose()with a timeout and fall through toprocess.exit(0)on expiry —dispose()is still attempted first (preserving the libc++abi guard on the normal path); the timeout only changes the pathological case from "hang forever" to "exit anyway". A crash on exit is strictly preferable to a zombie holding the job's full RSS. A fix PR follows.Versions
@livekit/agents1.4.x (reproduced on 1.4.4;mainand1.4.9both still unbounded).JobExecutorType.PROCESS.