P0: fix flusher shutdown deadlock; use notify_one for all workers#26
Open
pbudzik wants to merge 1 commit into
Open
P0: fix flusher shutdown deadlock; use notify_one for all workers#26pbudzik wants to merge 1 commit into
pbudzik wants to merge 1 commit into
Conversation
The FlusherWorker held an Arc<AppState> clone, which owns a flush_sender. Its run loop exited only when all senders dropped, so graceful shutdown in main.rs (drop(state); flusher_handle.await) would hang forever — the flusher's own clone kept the sender alive. Fix: give the flusher an explicit Arc<Notify> shutdown signal matching the rollup/compaction workers, with a tokio::select. On signal, drain any messages already queued (the shutdown-drain payload and any last-tick flushes from rollup/compaction) before exiting. Also switched all three workers' shutdown call sites from notify_waiters() to notify_one(). notify_waiters is lossy if no task is currently registered as a waiter; if a worker is mid-handle_message or mid-tick when shutdown fires, the notification can disappear and the deadlock reappears. notify_one stores a permit that the next notified().await consumes. Shutdown order reworked so producers stop before the flusher drains: 1. rollup_shutdown / compaction_shutdown 2. await rollup_handle / compaction_handle 3. flusher_shutdown 4. await flusher_handle 5. drop(state) Regression test in tests/p0_shutdown.rs reproduces the deadlock with an outstanding AppState clone and confirms the flusher exits within 5s of shutdown plus drains pending messages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the P0 shutdown deadlock flagged in the latest external review.
The
FlusherWorkerholds anArc<AppState>clone, andAppStateowns aflush_sender. The flusher's run loop was a plainwhile let Some(msg) = recv.await { ... }that exits only when all senders drop — but the flusher's own state clone keeps one sender alive forever.drop(state); flusher_handle.awaitinmain.rstherefore hangs on graceful shutdown.Changes
FlusherWorker::runnow takes anArc<Notify>shutdown signal and usestokio::select!, matching the rollup/compaction pattern. On signal it drains any already-queued messages (the shutdown-drain payload and any last-tick flushes) before exiting.main.rsshutdown is reordered: stop rollup + compaction (producers) and await them, then signal the flusher and await it, then drop state.notify_waiters()tonotify_one().notify_waitersis lossy if no waiter is currently registered — if a worker is mid-handle_message/mid-tick when shutdown fires, the notification disappears and the deadlock returns.notify_onestores a permit that the nextnotified().awaitconsumes.Test plan
tests/p0_shutdown.rs::flusher_shuts_down_even_with_state_clones_outstandingreproduces the original deadlock (outstandingAppStateclone present at shutdown) and confirms exit within 5s.tests/p0_shutdown.rs::flusher_drains_queued_messages_on_shutdownconfirms drain-on-shutdown still processes queued messages (the shutdown-drain payload depends on this).cargo testpasses — 143 tests across the suite.Notes
This is the first of several P0 fixes from the external review. Follow-ups (in order of priority):
ManifestStore::commit(|m| …)helper — fix mutate-then-save atomicity in flusher / rollup / compaction / period-closelast_sealed_wal_idand raw-segment generation are unchanged before saving)Open → Closing → Closedperiod lifecycle so close-period can't miss racing ingestsingested_at_ms, not eventtimestamp_ms🤖 Generated with Claude Code