Skip to content

Close device socket in dispatcherClean to unblock EventReader thread#105

Draft
alicespetma-stack wants to merge 1 commit intoluxonis:masterfrom
alicespetma-stack:fix/tcp-recv-timeout
Draft

Close device socket in dispatcherClean to unblock EventReader thread#105
alicespetma-stack wants to merge 1 commit intoluxonis:masterfrom
alicespetma-stack:fix/tcp-recv-timeout

Conversation

@alicespetma-stack
Copy link
Copy Markdown

@alicespetma-stack alicespetma-stack commented Mar 12, 2026

Summary

When DispatcherWaitEventComplete times out (e.g. after the timeout fixes in PR #104), DispatcherClean is called to clean up the dispatcher state. However, the EventReader thread remains stuck in recv() because the TCP socket is never closed during cleanup. This leaks a thread and socket for each failed connection attempt, preventing clean retry.

Problem

The cleanup flow today:

  1. DispatcherWaitEventComplete times out, returns error
  2. Caller invokes DispatcherClean
  3. dispatcherClean sets resetXLink = 1 and destroys semaphores
  4. EventReader thread checks resetXLink at the top of its loop...
  5. ...but never reaches the check because recv() is blocked forever

The EventReader's recv() call (in tcpipPlatformRead) has no SO_RCVTIMEO, so it blocks until data arrives or the connection breaks. If the remote device is unresponsive, neither happens.

Fix

Close the device socket at the start of dispatcherClean, before cleaning up events and semaphores. The shutdown(SHUT_RDWR) inside closeDeviceFd causes recv() to return an error immediately, allowing the EventReader thread to exit its loop.

The dispatcherDeviceFdDown flag prevents double-close when dispatcherReset (which also closes the FD) calls dispatcherClean. The flag check is inlined rather than calling dispatcherDeviceFdDown() to avoid deadlock on the non-recursive reset_mutex, which dispatcherReset already holds when it calls us.

Why not SO_RCVTIMEO?

SO_RCVTIMEO was considered but would affect all recv() calls on the socket, including healthy connections where large data transfers may legitimately take a long time. Closing the socket on cleanup is more surgical — it only interrupts recv() when we already know the connection is dead (timeout has occurred).

Files Modified

File Change
src/shared/XLinkDispatcher.c Close device socket early in dispatcherClean

Companion PR

This works together with PR #104 (configurable timeouts for XLinkConnect and XLinkOpenStream). PR #104 enables timeout-based error return from the main thread; this PR ensures the EventReader thread also cleans up.

Test Plan

  • Build on Linux and Windows
  • After XLinkConnect timeout, verify EventReader thread exits (no leaked threads)
  • After timeout + cleanup, verify subsequent XLinkConnect retry succeeds on a healthy device
  • Normal device operation unaffected (socket only closed during cleanup, not during active use)
  • dispatcherReset path still works correctly (no double-close)

When DispatcherWaitEventComplete times out and DispatcherClean is called,
the EventReader thread remains stuck in recv() because the TCP socket is
still open. This leaks a thread and socket for each failed connection
attempt, and prevents clean retry.

Fix: close the device socket at the start of dispatcherClean, before
cleaning up events and semaphores. The shutdown(SHUT_RDWR) inside
closeDeviceFd causes recv() to return an error, letting the EventReader
exit its loop via the resetXLink flag.

The dispatcherDeviceFdDown flag prevents double-close when dispatcherReset
(which also closes the FD) calls dispatcherClean. We inline the flag check
rather than calling dispatcherDeviceFdDown() to avoid deadlock on the
non-recursive reset_mutex.
@themarpe themarpe requested a review from moratom March 13, 2026 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant