Skip to content

Job child process can hang indefinitely in unbounded await dispose() and never reach process.exit(0) #1905

Description

@rinarakaki

Summary

In the process job executor, the job child's final teardown awaits dispose() (native FFI cleanup) without a timeout before process.exit(0) (agents/src/ipc/job_proc_lazy_main.ts):

await join.await;
try {
  await dispose();           // ← unbounded
  logger.debug('native resources disposed');
} catch (error) {
  logger.warn({ error }, 'failed to dispose native resources');
}
logger.debug('Job process shutdown');
process.exit(0);

If native disposal never returns (a native handle that never drains), the child never reaches process.exit(0) and lingers indefinitely, holding the job's full RSS.

Why nothing reclaims it

  • The child's JS event loop is still alive during the wedged native call, so it keeps answering supervisor pings — the orphan/health reaper considers it healthy and never kills it.
  • process.on('SIGINT' | 'SIGTERM') are installed as no-op log handlers, so the worker's graceful termination is swallowed and the child does not exit.

So a child wedged here is effectively immortal short of SIGKILL.

Production impact

On a long-running worker fleet we see a steady multi-day RSS climb: a few job children per day finish all their shutdown callbacks, then freeze in this final dispose() and never exit, each holding ~1.4 GB (~40% of it native/FFI memory). They accumulate until a task hits its memory cap and the kernel OOM-kills in-flight calls. A redeploy resets it; it recurs every ~2–3 days.

We traced one frozen child end-to-end: every application shutdown callback logged to completion, then the process emitted only periodic logs for the next 43 hours with no exit. By elimination (the shutdown callbacks all complete or are independently bounded), the only unbounded await left before process.exit(0) is this await dispose().

Note

dispose() is correct and necessary here — it was added to prevent the libc++abi mutex crash on exit-with-active-native-threads (livekit/node-sdks#564). The gap is only that the wait is unbounded. This is still the case on main and in the latest 1.4.9.

Related symptoms reported against the Python SDK: livekit/agents#3174 (closed not-planned), livekit/agents#3637, livekit/agents#617.

Proposed fix

Bound dispose() with a timeout and fall through to process.exit(0) on expiry — dispose() is still attempted first (preserving the libc++abi guard on the normal path); the timeout only changes the pathological case from "hang forever" to "exit anyway". A crash on exit is strictly preferable to a zombie holding the job's full RSS. A fix PR follows.

Versions

@livekit/agents 1.4.x (reproduced on 1.4.4; main and 1.4.9 both still unbounded). JobExecutorType.PROCESS.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions