rtc: fail fast instead of hanging when FFI is used across fork()#740
rtc: fail fast instead of hanging when FFI is used across fork()#740theomonnom wants to merge 1 commit into
Conversation
| assert FfiClient.instance._ffi_lib.livekit_ffi_drop_handle(ctypes.c_uint64(self.handle)) | ||
| ffi = FfiClient._owned_instance() | ||
| if ffi is not None: | ||
| assert ffi._ffi_lib.livekit_ffi_drop_handle(ctypes.c_uint64(self.handle)) |
There was a problem hiding this comment.
will assert be stripped out if our customers use some optimization flags ?
if that gets stripped out, I will assume ffi._ffi_lib.livekit_ffi_drop_handle will never be called.
xianshijing-lk
left a comment
There was a problem hiding this comment.
lgtm assuming you will address the comment about 'assert'
e192426 to
d06cfdd
Compare
| def dispose(self) -> None: | ||
| if self.handle != INVALID_HANDLE and not self._disposed: | ||
| self._disposed = True | ||
| assert FfiClient.instance._ffi_lib.livekit_ffi_drop_handle(ctypes.c_uint64(self.handle)) | ||
| dropped = FfiClient.instance._ffi_lib.livekit_ffi_drop_handle(ctypes.c_uint64(self.handle)) | ||
| assert dropped |
There was a problem hiding this comment.
🔴 Fork-safety guard missing on native handle cleanup, allowing child processes to hang or crash on exit
Inherited native handles are released into the dead runtime (livekit_ffi_drop_handle at livekit-rtc/livekit/rtc/_ffi_client.py:86) without the same process-ID check added to requests, so forked child processes can hang or abort during garbage collection.
Impact: A forked child process may hang or crash at exit when Python's garbage collector cleans up inherited handle objects.
Mechanism: FfiHandle.dispose() bypasses the fork guard added to request()
The PR adds a PID check in request() (livekit-rtc/livekit/rtc/_ffi_client.py:274) and in the atexit handler (livekit-rtc/livekit/rtc/_ffi_client.py:265) to prevent calling into the native FFI library from a forked child process. However, FfiHandle.dispose() at livekit-rtc/livekit/rtc/_ffi_client.py:83-87 also calls into the native library via livekit_ffi_drop_handle, and has no such guard.
When a parent process creates livekit objects (Room, tracks, audio streams, etc.), each creates FfiHandle instances. After fork(), the child inherits these objects. When the child exits, Python's GC invokes __del__ (livekit-rtc/livekit/rtc/_ffi_client.py:76-77) on these inherited handles, which calls dispose(), which calls livekit_ffi_drop_handle on the dead native runtime — the exact scenario the PR is designed to prevent.
Multiple callers use FfiHandle extensively: room.py:550, audio_stream.py:138, video_stream.py:70, track.py:35, participant.py:134, and many others.
| def dispose(self) -> None: | |
| if self.handle != INVALID_HANDLE and not self._disposed: | |
| self._disposed = True | |
| assert FfiClient.instance._ffi_lib.livekit_ffi_drop_handle(ctypes.c_uint64(self.handle)) | |
| dropped = FfiClient.instance._ffi_lib.livekit_ffi_drop_handle(ctypes.c_uint64(self.handle)) | |
| assert dropped | |
| def dispose(self) -> None: | |
| if self.handle != INVALID_HANDLE and not self._disposed: | |
| self._disposed = True | |
| if FfiClient.instance._pid != os.getpid(): | |
| return # native runtime does not survive fork() | |
| dropped = FfiClient.instance._ffi_lib.livekit_ffi_drop_handle(ctypes.c_uint64(self.handle)) | |
| assert dropped |
Was this helpful? React with 👍 or 👎 to provide feedback.
42a8d60 to
3a092f3
Compare
The native runtime (tokio worker/IO threads in liblivekit_ffi) does not survive fork() and cannot be rebuilt by re-initializing, so a child that inherited an initialized FfiClient would hang on every request. Guard request() with the owning pid and raise a clear error instead. The atexit dispose handler is likewise skipped in fork children (it needs the runtime and would hang at exit). Also move the drop_handle side effect out of an assert so it is not stripped under python -O.
3a092f3 to
389b573
Compare
|
oh, one question, is there any way to add a test for the fork behavior ? so that we can prevent regression |
The native runtime (tokio worker/IO threads in liblivekit_ffi) does not survive
fork(), and re-initializing only resets the event callback rather than rebuilding the runtime. A child process that inherited an already-initializedFfiClientsingleton therefore hangs on every FFI request.Guard
FfiClient.instancewith the owning pid and raise a clearRuntimeErrorinstead of hanging. This does not affectspawn/forkserver(each process initializes its own FFI); it only turns thefork-after-init footgun into an actionable error.