Job child process can hang indefinitely in unbounded `await dispose()` and never reach process.exit(0)

### Summary

In the process job executor, the job child's final teardown awaits `dispose()` (native FFI cleanup) **without a timeout** before `process.exit(0)` (`agents/src/ipc/job_proc_lazy_main.ts`):

```ts
await join.await;
try {
  await dispose();           // ← unbounded
  logger.debug('native resources disposed');
} catch (error) {
  logger.warn({ error }, 'failed to dispose native resources');
}
logger.debug('Job process shutdown');
process.exit(0);
```

If native disposal never returns (a native handle that never drains), the child **never reaches `process.exit(0)` and lingers indefinitely**, holding the job's full RSS.

### Why nothing reclaims it

- The child's JS event loop is still alive during the wedged native call, so it keeps answering supervisor pings — the orphan/health reaper considers it healthy and never kills it.
- `process.on('SIGINT' | 'SIGTERM')` are installed as no-op log handlers, so the worker's graceful termination is swallowed and the child does not exit.

So a child wedged here is effectively immortal short of `SIGKILL`.

### Production impact

On a long-running worker fleet we see a steady multi-day RSS climb: a few job children per day finish **all** their shutdown callbacks, then freeze in this final `dispose()` and never exit, each holding ~1.4 GB (~40% of it native/FFI memory). They accumulate until a task hits its memory cap and the kernel OOM-kills in-flight calls. A redeploy resets it; it recurs every ~2–3 days.

We traced one frozen child end-to-end: every application shutdown callback logged to completion, then the process emitted **only periodic logs for the next 43 hours with no exit**. By elimination (the shutdown callbacks all complete or are independently bounded), the only unbounded await left before `process.exit(0)` is this `await dispose()`.

### Note

`dispose()` is correct and necessary here — it was added to prevent the libc++abi mutex crash on exit-with-active-native-threads (livekit/node-sdks#564). The gap is only that the wait is **unbounded**. This is still the case on `main` and in the latest `1.4.9`.

Related symptoms reported against the Python SDK: livekit/agents#3174 (closed not-planned), livekit/agents#3637, livekit/agents#617.

### Proposed fix

Bound `dispose()` with a timeout and fall through to `process.exit(0)` on expiry — `dispose()` is still attempted first (preserving the libc++abi guard on the normal path); the timeout only changes the pathological case from "hang forever" to "exit anyway". A crash on exit is strictly preferable to a zombie holding the job's full RSS. A fix PR follows.

### Versions

`@livekit/agents` 1.4.x (reproduced on 1.4.4; `main` and `1.4.9` both still unbounded). `JobExecutorType.PROCESS`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Job child process can hang indefinitely in unbounded `await dispose()` and never reach process.exit(0) #1905

Summary

Why nothing reclaims it

Production impact

Note

Proposed fix

Versions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Job child process can hang indefinitely in unbounded await dispose() and never reach process.exit(0) #1905

Description

Summary

Why nothing reclaims it

Production impact

Note

Proposed fix

Versions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Job child process can hang indefinitely in unbounded `await dispose()` and never reach process.exit(0) #1905