Skip to content

fix: resolve concurrency bugs in eventbus and webtransport#3518

Open
Sahil-4555 wants to merge 4 commits into
libp2p:masterfrom
Sahil-4555:fix/concurrency-safety
Open

fix: resolve concurrency bugs in eventbus and webtransport#3518
Sahil-4555 wants to merge 4 commits into
libp2p:masterfrom
Sahil-4555:fix/concurrency-safety

Conversation

@Sahil-4555

Copy link
Copy Markdown
Contributor

Issue

observed two concurrency issues:

  1. Eventbus Deadlock: When multiple emitters concurrently broadcast events to a slow consumer (full queue), they can deadlock. Both the wildcard subscriber node and standard subscriber node shared a single slowConsumerTimer pointer. Concurrent emitters raced on Reset(), causing the timer to fire only once. One emitter consumed the tick and blocked on the channel, leaving other emitters permanently blocked on <-timer.C even after the queue was drained. This hang occurred under a read-lock (RLock), blocking any subscription closure (sub.Close()) or node updates from acquiring a write-lock.
  2. WebTransport Cert Manager Data Race: The SerializedCertHashes() getter read the shared m.serializedCertHashes slice without acquiring a lock. This caused a read/write data race with the background cert manager thread which updates the slice during certificate rotation. Returning the raw slice by reference also caused slice aliasing, exposing the caller to backing memory being concurrently mutated/resized.

Fix

  1. Eventbus Timer Isolation: Removed the shared slowConsumerTimer pointer field from node and wildcardNode. The slow consumer warning logic was refactored to use a local time.Timer allocated on the stack within emitAndLogError, eliminating shared timer state across concurrent emitters.
  2. WebTransport Thread-Safety: Added m.mx.RLock() to SerializedCertHashes() and returned a deep copy of the slice rather than a direct reference to avoid data races and slice aliasing.

Tests

  1. Eventbus: Added TestWildcardSlowConsumerDeadlock in p2p/host/eventbus/basic_test.go which simulates a full queue under concurrent emissions to verify that no deadlocks occur and the emitters safely proceed when the queue is drained.

Sahil-4555 and others added 4 commits June 16, 2026 11:30
Replace the wall-clock TestWildcardSlowConsumerDeadlock with a synctest
version that fires the warning timeout on a fake clock, so it runs
instantly and cannot flake on a loaded CI runner. Add a close-during-
stall test covering the safety valve where closing a subscription
releases an emit parked under the wildcard node read lock.
Add a test that the getter returns an independent copy, and a -race
test that hammers it during background certificate rotation. Both
fail on the unsynchronized getter and pass with the read lock and copy.

@lidel lidel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. I'd been debugging the same bug myself, so I'm glad you beat me to it :)

I pushed two small test commits on top:

  • Moved the eventbus slow-consumer test over to synctest. It runs on a fake clock now, so it's fast and won't flake on CI the way the sleep-based version could.
  • Added two tests for SerializedCertHashes: one checks the getter returns a copy, the other runs it against cert rotation under -race.

@MarcoPolo fwiw lgtm, and Kubo could benefit from this, but your call

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants