Skip to content

Wake takers parked in MessageQueue.take() when a read fails#100

Open
steviec wants to merge 1 commit into
masterfrom
fix-message-queue-lost-wakeup
Open

Wake takers parked in MessageQueue.take() when a read fails#100
steviec wants to merge 1 commit into
masterfrom
fix-message-queue-lost-wakeup

Conversation

@steviec

@steviec steviec commented Jun 3, 2026

Copy link
Copy Markdown

Problem

Two threads sharing one MessageQueue (e.g. a keep-alive reader thread + a request thread over one ADB connection) could hang forever. In take(), signalAll() ran only after a successful read(). When read() threw (connection reset / EOF / SocketTimeoutException), any taker parked in queueCond.await() was never woken — a classic lost wakeup. Neither the read nor the write timeout covers await().

Fix

  • Signal on every read() outcome (try { read() } finally { signalAll() }). A failed read releases readLock and wakes the waiters; each woken taker retries the read and observes the failure for itself. Errors stay per-call and non-sticky — a transient read error no longer poisons the connection.
  • close() is the one terminal case. It sets a close marker (signalClosed) and wakes every parked taker so they fail fast with a clean IOException, instead of stranding in await() or waking to re-read an already-closed reader (which would surface as IllegalStateException). The marker is set only on close — when the socket also closes — so it stays in step with DadbImpl rebuilding the connection on the next op (no permanent bricking).

This preserves the pre-existing failure contract (per-call, retryable IOException; connection liveness owned by the socket/DadbImpl) and only closes the lost-wakeup hole.

Tests

Deterministic and emulator-free:

  • a read failure wakes every parked taker, each as IOException;
  • a transient read failure does not poison the queue — a later take() still delivers a subsequently-read message;
  • close() wakes every parked taker with a clean IOException (exercised through the real AdbMessageQueue/AdbReader path).

Existing happy-path concurrency tests (MessageQueueTest) continue to pass.

@steviec steviec force-pushed the fix-message-queue-lost-wakeup branch from fb2908e to 8d4e081 Compare June 3, 2026 18:57
A read that threw left parked takers stranded: signalAll() ran only after a
successful read(), so a failing read (reset/EOF/timeout) never woke them and
take() hung forever (e.g. a keep-alive reader thread plus a request thread
sharing one connection).

Fix: signal on every read() outcome via try/finally. A failed read releases
readLock and wakes the waiters; each retries and observes the failure itself,
so errors stay per-call and non-sticky — a transient read error no longer
poisons the connection.

close() is the one terminal case: it sets a close marker (signalClosed) and
wakes every taker so they fail fast with a clean IOException instead of
stranding in await() or waking to re-read the closed reader. The marker is set
only on close (when the socket also closes), so it stays in step with
DadbImpl's rebuild-on-next-op.

Tests (deterministic, emulator-free): a read failure wakes every parked taker
as IOException; a transient failure does not poison the queue (a later take()
still delivers); close() wakes every parked taker with an IOException.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@steviec steviec force-pushed the fix-message-queue-lost-wakeup branch from 8d4e081 to daec36a Compare June 3, 2026 19:58

@amanjeetsingh150 amanjeetsingh150 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just writing one concern a bigger question we can chat offline about: How should we get confidence on changes with dadb? e2e on maestro core should help here but we might want to do snapshot release and then do final one.

@steviec

steviec commented Jun 5, 2026

Copy link
Copy Markdown
Author

Just writing one concern a bigger question we can chat offline about: How should we get confidence on changes with dadb? e2e on maestro core should help here but we might want to do snapshot release and then do final one.

I had the exact same question as I was making these changes. The problem with this (and with maestro core) is lack of robust test coverage. If we already have a harness in place that we had confidence really tested the mechanics of this library, we could make changes rapidly with confidence. So the real fix, IMO, would be to actually rebuild this library from the ground up, driven by behavior tests that are rock solid. Tests and test harnesses are more important than code now IMO, so if you have code without tests, the code is potentially not worth keeping (and only using as guidance for "correct" behavior to the test harness that you're building)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants