Wake takers parked in MessageQueue.take() when a read fails#100
Conversation
fb2908e to
8d4e081
Compare
A read that threw left parked takers stranded: signalAll() ran only after a successful read(), so a failing read (reset/EOF/timeout) never woke them and take() hung forever (e.g. a keep-alive reader thread plus a request thread sharing one connection). Fix: signal on every read() outcome via try/finally. A failed read releases readLock and wakes the waiters; each retries and observes the failure itself, so errors stay per-call and non-sticky — a transient read error no longer poisons the connection. close() is the one terminal case: it sets a close marker (signalClosed) and wakes every taker so they fail fast with a clean IOException instead of stranding in await() or waking to re-read the closed reader. The marker is set only on close (when the socket also closes), so it stays in step with DadbImpl's rebuild-on-next-op. Tests (deterministic, emulator-free): a read failure wakes every parked taker as IOException; a transient failure does not poison the queue (a later take() still delivers); close() wakes every parked taker with an IOException. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8d4e081 to
daec36a
Compare
amanjeetsingh150
left a comment
There was a problem hiding this comment.
Just writing one concern a bigger question we can chat offline about: How should we get confidence on changes with dadb? e2e on maestro core should help here but we might want to do snapshot release and then do final one.
I had the exact same question as I was making these changes. The problem with this (and with maestro core) is lack of robust test coverage. If we already have a harness in place that we had confidence really tested the mechanics of this library, we could make changes rapidly with confidence. So the real fix, IMO, would be to actually rebuild this library from the ground up, driven by behavior tests that are rock solid. Tests and test harnesses are more important than code now IMO, so if you have code without tests, the code is potentially not worth keeping (and only using as guidance for "correct" behavior to the test harness that you're building) |
Problem
Two threads sharing one
MessageQueue(e.g. a keep-alive reader thread + a request thread over one ADB connection) could hang forever. Intake(),signalAll()ran only after a successfulread(). Whenread()threw (connection reset / EOF /SocketTimeoutException), any taker parked inqueueCond.await()was never woken — a classic lost wakeup. Neither the read nor the write timeout coversawait().Fix
read()outcome (try { read() } finally { signalAll() }). A failed read releasesreadLockand wakes the waiters; each woken taker retries the read and observes the failure for itself. Errors stay per-call and non-sticky — a transient read error no longer poisons the connection.close()is the one terminal case. It sets a close marker (signalClosed) and wakes every parked taker so they fail fast with a cleanIOException, instead of stranding inawait()or waking to re-read an already-closed reader (which would surface asIllegalStateException). The marker is set only on close — when the socket also closes — so it stays in step withDadbImplrebuilding the connection on the next op (no permanent bricking).This preserves the pre-existing failure contract (per-call, retryable
IOException; connection liveness owned by the socket/DadbImpl) and only closes the lost-wakeup hole.Tests
Deterministic and emulator-free:
IOException;take()still delivers a subsequently-read message;close()wakes every parked taker with a cleanIOException(exercised through the realAdbMessageQueue/AdbReaderpath).Existing happy-path concurrency tests (
MessageQueueTest) continue to pass.