rabbit_quorum_queue: Shrink batches of QQs in parallel #15081

the-mikedavis · 2025-12-05T22:29:29Z

Shrinking a member node off of a QQ can be parallelized. The operation involves

removing the node from the QQ's cluster membership (appending a command to the log and committing it) with ra:remove_member/3
updating the metadata store to remove the member from the QQ type state with rabbit_amqqueue:update/2
deleting the queue data from the node with ra:force_delete_server/2 if the node can be reached

All of these operations are I/O bound. Updating the cluster membership and metadata store involves appending commands to those logs and replicating them. Writing commands to Ra synchronously in serial is fairly slow - sending many commands in parallel is much more efficient. By parallelizing these steps we can write larger chunks of commands to WAL(s).

ra:force_delete_server/2 benefits from parallelizing if the node being shrunk off is no longer reachable, for example in some hardware failures. The underlying rpc:call/4 will attempt to auto-connect to the node and this can take some time to time out. By parallelizing this, each rpc:call/4 reuses the same underlying distribution entry and all calls fail together once the connection fails to establish.

Discussed in #15003

Shrinking a member node off of a QQ can be parallelized. The operation involves * removing the node from the QQ's cluster membership (appending a command to the log and committing it) with `ra:remove_member/3` * updating the metadata store to remove the member from the QQ type state with `rabbit_amqqueue:update/2` * deleting the queue data from the node with `ra:force_delete_server/2` if the node can be reached All of these operations are I/O bound. Updating the cluster membership and metadata store involves appending commands to those logs and replicating them. Writing commands to Ra synchronously in serial is fairly slow - sending many commands in parallel is much more efficient. By parallelizing these steps we can write larger chunks of commands to WAL(s). `ra:force_delete_server/2` benefits from parallelizing if the node being shrunk off is no longer reachable, for example in some hardware failures. The underlying `rpc:call/4` will attempt to auto-connect to the node and this can take some time to time out. By parallelizing this, each `rpc:call/4` reuses the same underlying distribution entry and all calls fail together once the connection fails to establish.

the-mikedavis · 2025-12-05T22:31:39Z

With this change and the default 64 set here (just a sensible-seeming constant) I see my test in #15003 of shrinking from 1000 QQs go from taking ~2hrs to taking 1min52sec.

michaelklishin · 2025-12-05T22:59:48Z

deps/rabbit/src/rabbit_quorum_queue.erl

+                                               Res = shrink(Node, Q),
+                                               Parent ! {self(), Res}
+                                       end) || Q <- Chunk],
+                      [receive


We should introduce a timeout here. All timeout defaults are wrong but something like 10s or 15s won't suffer from false positives much but avoids blocking the parent process.

Yeah, I wonder if we should also spawn_monitor and look for 'DOWN's as well. I haven't seen this code crash before but it wouldn't hurt to catch exits and avoid this list comprehension getting stuck

That would be even better. Both are just typical defensive techniques.

michaelklishin reviewed Dec 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rabbit_quorum_queue: Shrink batches of QQs in parallel #15081

rabbit_quorum_queue: Shrink batches of QQs in parallel #15081

the-mikedavis commented Dec 5, 2025

Uh oh!

the-mikedavis commented Dec 5, 2025

Uh oh!

michaelklishin Dec 5, 2025

Uh oh!

the-mikedavis Dec 5, 2025

Uh oh!

michaelklishin Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rabbit_quorum_queue: Shrink batches of QQs in parallel #15081

Are you sure you want to change the base?

rabbit_quorum_queue: Shrink batches of QQs in parallel #15081

Conversation

the-mikedavis commented Dec 5, 2025

Uh oh!

the-mikedavis commented Dec 5, 2025

Uh oh!

michaelklishin Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

the-mikedavis Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

michaelklishin Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants