Skip to content

fix: prevent stream localId collisions from destroying live stream state#102

Merged
pedro18x merged 1 commit into
masterfrom
fix/local-id-collision
Jun 12, 2026
Merged

fix: prevent stream localId collisions from destroying live stream state#102
pedro18x merged 1 commit into
masterfrom
fix/local-id-collision

Conversation

@pedro18x

@pedro18x pedro18x commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What this looks like in production

DuckDuckGo (and 20+ other orgs over the last 14 days, internal ref: MA-4060) run Maestro flows with androidWebViewHierarchy: devtools, which fetches WebView hierarchies over Chrome DevTools using short-lived streams (HTTP + websocket via OkHttp) on the same Dadb connection the driver uses for taps. Their flows fail on native taps that already matched: the tap is issued, and the command then dies with:

CommandFailed: Not listening for localId: 1844909876

For the affected flows this failed 4 out of 4 attempts across 4 different hosts. The exception escapes the devtools client's catch (e: IOException) (it is an IllegalStateException) and fails the customer's command.

Root cause

MessageQueue keys all stream state on the raw localId int, and stopListening(localId) removes it without holding any lock. There are two ways a live stream's state gets destroyed:

1. Close-during-read race (no collision needed, the frequent path). AdbStreamImpl.nextMessage already handles "stream closed while reading" gracefully:

private fun nextMessage(command: Int): AdbMessage? {
    return try {
        messageQueue.take(localId, command)
    } catch (e: IOException) {   // AdbStreamClosed is an IOException -> clean EOF
        close()
        return null
    }
}

But if another thread calls close() (-> stopListening) while a reader is blocked inside take(), the next poll() finds queues[localId] == null and throws IllegalStateException("Not listening for localId") instead of AdbStreamClosed. Not an IOException, so it escapes the catch above and every caller's IO error handling. OkHttp closes sockets from watchdog/cancellation/pool-eviction threads while its reader threads are blocked, so the devtools path hits this interleaving routinely.

2. localId collision (rare, the latent design flaw). AdbConnection.newId() is random.nextInt() with no uniqueness check. If two concurrently-open streams draw the same ID, startListening silently no-ops (putIfAbsent) and both streams share one state entry; the first to close wipes the survivor's state and its next take() throws the same fatal exception.

Changes

  1. Sequential stream IDs (AtomicInteger), like AOSP's adb client. Collisions within a connection become impossible.
  2. startListening fails fast on a duplicate ID instead of silently sharing one state entry between two streams.
  3. A missing queue entry in poll() now throws AdbStreamClosed instead of IllegalStateException. This makes the close-during-read race resolve through the existing IOException handling as a normal end-of-stream, which is what "the stream was closed under you" should mean.

Testing

The new tests in MessageQueueTest fail on master and pass with this change. The full MessageQueueTest suite (including the existing concurrency stress tests) passes. Device-dependent tests (DadbTest) were not run locally and rely on CI.

AdbConnection allocated stream IDs with random.nextInt() and never checked
them against streams already open on the connection. MessageQueue keys all
stream state on that raw ID, and stopListening removes it, so when two
concurrently-open streams collided on an ID, the first one to close wiped
the survivor's queues entry and its next take() failed with the fatal
IllegalStateException("Not listening for localId") even though it was
never closed.

- Allocate IDs sequentially (AtomicInteger), like AOSP's adb client, so
  collisions within a connection are impossible.
- startListening now fails fast on a duplicate ID instead of silently
  sharing state between two streams.
- A missing queues entry in poll() now surfaces as the graceful
  AdbStreamClosed (which callers already handle) instead of
  IllegalStateException.
@linear-code

linear-code Bot commented Jun 9, 2026

Copy link
Copy Markdown

MA-4060

@Leland-Takamine

Copy link
Copy Markdown
Contributor

So when two concurrently-open streams collide on an ID

What are the chances of this actually happening?

@pedro18x

pedro18x commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

A pure random collision is rare. With k streams open, each new open() collides with probability k/2^32, so even a flow doing thousands of opens with a handful of live streams is in the one-in-millions range per flow. That alone can't explain what we see in production (the affected DuckDuckGo flows fail 4/4).

Digging further into the source while writing this up: the frequent path to this exact exception doesn't need a collision at all. stopListening holds no lock, so if one thread calls close() on a stream while another thread is blocked in take() on that same stream, the next poll() finds the queues entry gone and throws the fatal IllegalStateException instead of AdbStreamClosed. nextMessage only catches IOException, so instead of the clean EOF you'd get from AdbStreamClosed, the ISE escapes into the application. The devtools WebView path produces exactly that interleaving all the time, because OkHttp closes sockets from its watchdog/cancel/pool threads while its reader threads are blocked mid-read. That's the mechanism behind the production frequency; the random-ID collision is the rarer latent flaw on top.

I've updated the PR description to lay out both paths. The change covers both: sequential IDs make collisions impossible, and converting the missing-entry case to AdbStreamClosed makes the close-during-read race resolve as a normal end-of-stream through the IOException handling that's already there.

(Pedro and Claudro thought through this)

@steviec steviec left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super clean, nice

@pedro18x pedro18x merged commit ff172a2 into master Jun 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants