Skip to content

Bound socket writes with a timeout#99

Open
steviec wants to merge 2 commits into
masterfrom
add-write-timeout
Open

Bound socket writes with a timeout#99
steviec wants to merge 2 commits into
masterfrom
add-write-timeout

Conversation

@steviec

@steviec steviec commented Jun 1, 2026

Copy link
Copy Markdown

Problem

Socket.setSoTimeout (exposed as socketTimeout) only bounds reads. The writes that begin every shell/pull/push — the OPEN, plus streamed WRTE payloads — had no deadline. If adbd stops draining the socket (a wedged or half-open connection), a write blocks indefinitely: the read timeout never fires because it doesn't cover writes, and nothing closes the socket. A caller's only recourse was its own external watchdog.

Fix

Always bound writes via okio's socket sink timeout (WRITE_TIMEOUT_MILLIS = 10s, matching OkHttp's default write timeout). okio arms a SocketAsyncTimeout per chunk, so this is a per-progress stall deadline — a healthy large transfer resets it every 64KB, not a total-transfer cap — and on expiry it closes the socket, so the write throws SocketTimeoutException and DadbImpl rebuilds the connection on the next op.

It's applied unconditionally rather than exposed as a parameter: unlike connectTimeout / socketTimeout / keepAlive (pass-through Socket options that legitimately vary by caller), a write-stall bound has one correct value — "the socket is wedged" — so it's a safe-by-default correctness guard, not a tunable. Public API is unchanged.

Also fixes a latent bug: list() passed connectTimeout / socketTimeout to create() in swapped order.

Testing

Added a full-stack test (WriteTimeoutTest) against a real emulator, inducing the wedge by freezing adbd (SIGSTOP): the write now fails fast with SocketTimeoutException at the deadline and the connection self-heals on the next op; the read path is unchanged. Full suite green (58 tests).

🤖 Generated with Claude Code

steviec added 2 commits June 1, 2026 15:39
Socket.setSoTimeout (exposed as socketTimeout) only bounds reads. The
writes that begin every shell/pull/push -- the OPEN, plus streamed WRTE
payloads -- had no deadline, so if adbd stops draining the socket (a
wedged or half-open connection) a write blocks indefinitely: the read
timeout never fires because it does not cover writes, and nothing closes
the socket.

Always bound writes via okio's socket sink timeout (10s, matching
OkHttp's default). okio arms a SocketAsyncTimeout per chunk, so this is a
per-progress stall deadline -- a healthy large transfer resets it every
64KB -- and on expiry it closes the socket, so the write throws
SocketTimeoutException and the connection rebuilds on the next op.

Applied unconditionally rather than exposed as a parameter: unlike
connectTimeout/socketTimeout/keepAlive (pass-through Socket options that
vary by caller), a write-stall bound has one correct value, so it is a
safe-by-default guard, not a tunable. Public API is unchanged.

Also fix list() passing connectTimeout/socketTimeout to create() in
swapped order.
The previous WriteTimeoutTest wedged the connection by SIGSTOP-ing the
emulator process. That is environment-dependent: on macOS it stalls
cleanly, but on Linux CI the freeze surfaced as a connection reset and a
race, so the test failed.

Replace it with a transparent TCP relay between dadb and the real adbd on
5555: the relay forwards both directions untouched (so the real ADB
handshake/protocol run against real adbd), and pausing a pump wedges one
direction deterministically. A real adbd buffers hundreds of MB, but a
relay that stops reading lets only the OS socket buffers absorb the write
before it blocks -- so the write deadline fires on any platform, with no
process freezing, no root, and no SIGSTOP.

Add an internal-only writeTimeoutMillis seam (DadbImpl / AdbConnection.
connect) so the test runs in ~1s instead of the 10s production deadline.
It is not exposed on the public Dadb.create API.
@steviec steviec closed this Jun 2, 2026
@steviec steviec reopened this Jun 2, 2026
fun connect(socket: Socket, keyPair: AdbKeyPair? = null, writeTimeoutMillis: Long = WRITE_TIMEOUT_MILLIS): AdbConnection {
val source = socket.source()
val sink = socket.sink()
sink.timeout().timeout(writeTimeoutMillis, TimeUnit.MILLISECONDS)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach overall, but I'm not sure socket writes are the only place dadb can get stuck, this specifically guards the socket write syscall, right? My review angle was "is there anything else inside dadb that can hang?", and two things stood out:

  1. No observability inside dadb. When a run wedges we can't see where, there are no logs around the last operation, so we're guessing. Worth adding some.
  2. There's a hang that neither the read nor the write timeout covers:

I dug into the dadb + maestro code with Claude and found it comes from two threads sharing one dadb connection. The driver tunnels its gRPC channel over the same dadb connection, so besides the consumer (test-runner) thread there's a gRPC reader thread doing keep-alive reads:

The potential hang, in MessageQueue.take():

  1. gRPC reader thread G holds readLock, parked in read() (idle, waiting for frames).
  2. Test-runner thread T runs the pull → take(sync, WRTE) → tryLock(readLock) fails (G has it) → T parks in await(), which has no timeout.
  3. adbd wedges. G's read() throws at 120s (SO_TIMEOUT): but signalAll() runs after read(), so on the exception it's skipped.
  4. Nobody wakes T. readLock is free, but T only retries once signaled. T parks forever, ended only by the 900s job watchdog.

T is never in read() or write(), so neither timeout fires. This points to needing a timeout on the await() here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants