Bound socket writes with a timeout#99
Open
steviec wants to merge 2 commits into
Open
Conversation
Socket.setSoTimeout (exposed as socketTimeout) only bounds reads. The writes that begin every shell/pull/push -- the OPEN, plus streamed WRTE payloads -- had no deadline, so if adbd stops draining the socket (a wedged or half-open connection) a write blocks indefinitely: the read timeout never fires because it does not cover writes, and nothing closes the socket. Always bound writes via okio's socket sink timeout (10s, matching OkHttp's default). okio arms a SocketAsyncTimeout per chunk, so this is a per-progress stall deadline -- a healthy large transfer resets it every 64KB -- and on expiry it closes the socket, so the write throws SocketTimeoutException and the connection rebuilds on the next op. Applied unconditionally rather than exposed as a parameter: unlike connectTimeout/socketTimeout/keepAlive (pass-through Socket options that vary by caller), a write-stall bound has one correct value, so it is a safe-by-default guard, not a tunable. Public API is unchanged. Also fix list() passing connectTimeout/socketTimeout to create() in swapped order.
The previous WriteTimeoutTest wedged the connection by SIGSTOP-ing the emulator process. That is environment-dependent: on macOS it stalls cleanly, but on Linux CI the freeze surfaced as a connection reset and a race, so the test failed. Replace it with a transparent TCP relay between dadb and the real adbd on 5555: the relay forwards both directions untouched (so the real ADB handshake/protocol run against real adbd), and pausing a pump wedges one direction deterministically. A real adbd buffers hundreds of MB, but a relay that stops reading lets only the OS socket buffers absorb the write before it blocks -- so the write deadline fires on any platform, with no process freezing, no root, and no SIGSTOP. Add an internal-only writeTimeoutMillis seam (DadbImpl / AdbConnection. connect) so the test runs in ~1s instead of the 10s production deadline. It is not exposed on the public Dadb.create API.
| fun connect(socket: Socket, keyPair: AdbKeyPair? = null, writeTimeoutMillis: Long = WRITE_TIMEOUT_MILLIS): AdbConnection { | ||
| val source = socket.source() | ||
| val sink = socket.sink() | ||
| sink.timeout().timeout(writeTimeoutMillis, TimeUnit.MILLISECONDS) |
Contributor
There was a problem hiding this comment.
I like the approach overall, but I'm not sure socket writes are the only place dadb can get stuck, this specifically guards the socket write syscall, right? My review angle was "is there anything else inside dadb that can hang?", and two things stood out:
- No observability inside dadb. When a run wedges we can't see where, there are no logs around the last operation, so we're guessing. Worth adding some.
- There's a hang that neither the read nor the write timeout covers:
I dug into the dadb + maestro code with Claude and found it comes from two threads sharing one dadb connection. The driver tunnels its gRPC channel over the same dadb connection, so besides the consumer (test-runner) thread there's a gRPC reader thread doing keep-alive reads:
The potential hang, in MessageQueue.take():
- gRPC reader thread G holds readLock, parked in read() (idle, waiting for frames).
- Test-runner thread T runs the pull → take(sync, WRTE) → tryLock(readLock) fails (G has it) → T parks in await(), which has no timeout.
- adbd wedges. G's read() throws at 120s (SO_TIMEOUT): but signalAll() runs after read(), so on the exception it's skipped.
- Nobody wakes T. readLock is free, but T only retries once signaled. T parks forever, ended only by the 900s job watchdog.
T is never in read() or write(), so neither timeout fires. This points to needing a timeout on the await() here.
amanjeetsingh150
approved these changes
Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Socket.setSoTimeout(exposed assocketTimeout) only bounds reads. The writes that begin everyshell/pull/push— theOPEN, plus streamedWRTEpayloads — had no deadline. If adbd stops draining the socket (a wedged or half-open connection), a write blocks indefinitely: the read timeout never fires because it doesn't cover writes, and nothing closes the socket. A caller's only recourse was its own external watchdog.Fix
Always bound writes via okio's socket sink timeout (
WRITE_TIMEOUT_MILLIS = 10s, matching OkHttp's default write timeout). okio arms aSocketAsyncTimeoutper chunk, so this is a per-progress stall deadline — a healthy large transfer resets it every 64KB, not a total-transfer cap — and on expiry it closes the socket, so the write throwsSocketTimeoutExceptionandDadbImplrebuilds the connection on the next op.It's applied unconditionally rather than exposed as a parameter: unlike
connectTimeout/socketTimeout/keepAlive(pass-throughSocketoptions that legitimately vary by caller), a write-stall bound has one correct value — "the socket is wedged" — so it's a safe-by-default correctness guard, not a tunable. Public API is unchanged.Also fixes a latent bug:
list()passedconnectTimeout/socketTimeouttocreate()in swapped order.Testing
Added a full-stack test (
WriteTimeoutTest) against a real emulator, inducing the wedge by freezing adbd (SIGSTOP): the write now fails fast withSocketTimeoutExceptionat the deadline and the connection self-heals on the next op; the read path is unchanged. Full suite green (58 tests).🤖 Generated with Claude Code