P0: fix flusher shutdown deadlock; use notify_one for all workers by pbudzik · Pull Request #26 · pbudzik/usagedb

pbudzik · 2026-05-17T06:25:17Z

Summary

Fixes the P0 shutdown deadlock flagged in the latest external review.

The FlusherWorker holds an Arc<AppState> clone, and AppState owns a flush_sender. The flusher's run loop was a plain while let Some(msg) = recv.await { ... } that exits only when all senders drop — but the flusher's own state clone keeps one sender alive forever. drop(state); flusher_handle.await in main.rs therefore hangs on graceful shutdown.

Changes

FlusherWorker::run now takes an Arc<Notify> shutdown signal and uses tokio::select!, matching the rollup/compaction pattern. On signal it drains any already-queued messages (the shutdown-drain payload and any last-tick flushes) before exiting.
main.rs shutdown is reordered: stop rollup + compaction (producers) and await them, then signal the flusher and await it, then drop state.
All three workers' shutdown call sites switched from notify_waiters() to notify_one(). notify_waiters is lossy if no waiter is currently registered — if a worker is mid-handle_message/mid-tick when shutdown fires, the notification disappears and the deadlock returns. notify_one stores a permit that the next notified().await consumes.

Test plan

New tests/p0_shutdown.rs::flusher_shuts_down_even_with_state_clones_outstanding reproduces the original deadlock (outstanding AppState clone present at shutdown) and confirms exit within 5s.
New tests/p0_shutdown.rs::flusher_drains_queued_messages_on_shutdown confirms drain-on-shutdown still processes queued messages (the shutdown-drain payload depends on this).
Full cargo test passes — 143 tests across the suite.

Notes

This is the first of several P0 fixes from the external review. Follow-ups (in order of priority):

ManifestStore::commit(|m| …) helper — fix mutate-then-save atomicity in flusher / rollup / compaction / period-close
Watermark race-safe commit (validate last_sealed_wal_id and raw-segment generation are unchanged before saving)
Open → Closing → Closed period lifecycle so close-period can't miss racing ingests
Dedupe TTL on ingested_at_ms, not event timestamp_ms

🤖 Generated with Claude Code

The FlusherWorker held an Arc<AppState> clone, which owns a flush_sender. Its run loop exited only when all senders dropped, so graceful shutdown in main.rs (drop(state); flusher_handle.await) would hang forever — the flusher's own clone kept the sender alive. Fix: give the flusher an explicit Arc<Notify> shutdown signal matching the rollup/compaction workers, with a tokio::select. On signal, drain any messages already queued (the shutdown-drain payload and any last-tick flushes from rollup/compaction) before exiting. Also switched all three workers' shutdown call sites from notify_waiters() to notify_one(). notify_waiters is lossy if no task is currently registered as a waiter; if a worker is mid-handle_message or mid-tick when shutdown fires, the notification can disappear and the deadlock reappears. notify_one stores a permit that the next notified().await consumes. Shutdown order reworked so producers stop before the flusher drains: 1. rollup_shutdown / compaction_shutdown 2. await rollup_handle / compaction_handle 3. flusher_shutdown 4. await flusher_handle 5. drop(state) Regression test in tests/p0_shutdown.rs reproduces the deadlock with an outstanding AppState clone and confirms the flusher exits within 5s of shutdown plus drains pending messages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pbudzik mentioned this pull request May 17, 2026

P0: atomic manifest commit helper — clone-mutate-save-publish #27

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P0: fix flusher shutdown deadlock; use notify_one for all workers#26

P0: fix flusher shutdown deadlock; use notify_one for all workers#26
pbudzik wants to merge 1 commit into
mainfrom
fix/flusher-shutdown

pbudzik commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pbudzik commented May 17, 2026

Summary

Changes

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant