Skip to content

P0: fix flusher shutdown deadlock; use notify_one for all workers#26

Open
pbudzik wants to merge 1 commit into
mainfrom
fix/flusher-shutdown
Open

P0: fix flusher shutdown deadlock; use notify_one for all workers#26
pbudzik wants to merge 1 commit into
mainfrom
fix/flusher-shutdown

Conversation

@pbudzik

@pbudzik pbudzik commented May 17, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes the P0 shutdown deadlock flagged in the latest external review.

The FlusherWorker holds an Arc<AppState> clone, and AppState owns a flush_sender. The flusher's run loop was a plain while let Some(msg) = recv.await { ... } that exits only when all senders drop — but the flusher's own state clone keeps one sender alive forever. drop(state); flusher_handle.await in main.rs therefore hangs on graceful shutdown.

Changes

  • FlusherWorker::run now takes an Arc<Notify> shutdown signal and uses tokio::select!, matching the rollup/compaction pattern. On signal it drains any already-queued messages (the shutdown-drain payload and any last-tick flushes) before exiting.
  • main.rs shutdown is reordered: stop rollup + compaction (producers) and await them, then signal the flusher and await it, then drop state.
  • All three workers' shutdown call sites switched from notify_waiters() to notify_one(). notify_waiters is lossy if no waiter is currently registered — if a worker is mid-handle_message/mid-tick when shutdown fires, the notification disappears and the deadlock returns. notify_one stores a permit that the next notified().await consumes.

Test plan

  • New tests/p0_shutdown.rs::flusher_shuts_down_even_with_state_clones_outstanding reproduces the original deadlock (outstanding AppState clone present at shutdown) and confirms exit within 5s.
  • New tests/p0_shutdown.rs::flusher_drains_queued_messages_on_shutdown confirms drain-on-shutdown still processes queued messages (the shutdown-drain payload depends on this).
  • Full cargo test passes — 143 tests across the suite.

Notes

This is the first of several P0 fixes from the external review. Follow-ups (in order of priority):

  1. ManifestStore::commit(|m| …) helper — fix mutate-then-save atomicity in flusher / rollup / compaction / period-close
  2. Watermark race-safe commit (validate last_sealed_wal_id and raw-segment generation are unchanged before saving)
  3. Open → Closing → Closed period lifecycle so close-period can't miss racing ingests
  4. Dedupe TTL on ingested_at_ms, not event timestamp_ms

🤖 Generated with Claude Code

The FlusherWorker held an Arc<AppState> clone, which owns a
flush_sender. Its run loop exited only when all senders dropped, so
graceful shutdown in main.rs (drop(state); flusher_handle.await) would
hang forever — the flusher's own clone kept the sender alive.

Fix: give the flusher an explicit Arc<Notify> shutdown signal matching
the rollup/compaction workers, with a tokio::select. On signal, drain
any messages already queued (the shutdown-drain payload and any
last-tick flushes from rollup/compaction) before exiting.

Also switched all three workers' shutdown call sites from
notify_waiters() to notify_one(). notify_waiters is lossy if no task
is currently registered as a waiter; if a worker is mid-handle_message
or mid-tick when shutdown fires, the notification can disappear and
the deadlock reappears. notify_one stores a permit that the next
notified().await consumes.

Shutdown order reworked so producers stop before the flusher drains:
  1. rollup_shutdown / compaction_shutdown
  2. await rollup_handle / compaction_handle
  3. flusher_shutdown
  4. await flusher_handle
  5. drop(state)

Regression test in tests/p0_shutdown.rs reproduces the deadlock with
an outstanding AppState clone and confirms the flusher exits within
5s of shutdown plus drains pending messages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant