Skip to content

rpc: add DRPC stream multiplexing behind an env gate#170248

Closed
shubhamdhama wants to merge 4 commits into
cockroachdb:masterfrom
shubhamdhama:add-stream-multiplexing
Closed

rpc: add DRPC stream multiplexing behind an env gate#170248
shubhamdhama wants to merge 4 commits into
cockroachdb:masterfrom
shubhamdhama:add-stream-multiplexing

Conversation

@shubhamdhama

Copy link
Copy Markdown
Contributor

Bumps DRPC to pick up stream multiplexing from cockroachdb/drpc#58,
where a single transport connection can carry multiple concurrent
streams.

The multiplexing dial path is guarded by
COCKROACH_EXPERIMENTAL_DRPC_MUX_ENABLED, defaulted to false. With the
gate off, DialDRPC keeps using drpcpool, which checks out a connection
per active stream and dials a new one when none is idle. The upgraded
multiplexing-capable library is therefore exercised on every connection,
but the multiplexing capability itself stays unused: every stream still
ends up on its own underlying connection, matching the prior behavior.

This keeps the initial scope small. The dial-mux path is in place so
it can be turned on for validation, with a follow-up to flip the
default once we are confident.

Both paths now share the same dial-option setup, so the mux path also
picks up the client metrics and the request-recording gate that the
pool path was already attaching.

Release note: None
Epic: none

Adds a small helper that bumps the DRPC replace directive in go.mod to
a specific commit. It takes a GitHub commit URL (cockroachdb or
shubhamdhama fork), resolves the pseudo-version with `go list`,
rewrites the replace line, runs `go mod tidy`, and regenerates the
Bazel files via `./dev generate bazel --mirror`.

Bumping DRPC was a multi-step ritual; the script removes the chance
of forgetting one of the steps.

Release note: None
Epic: none
@shubhamdhama shubhamdhama requested a review from a team May 13, 2026 08:41
@shubhamdhama shubhamdhama requested review from a team as code owners May 13, 2026 08:41
@shubhamdhama shubhamdhama added the do-not-merge bors won't merge a PR with this label. label May 13, 2026
@trunk-io

trunk-io Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@cockroach-teamcity

Copy link
Copy Markdown
Member

This change is Reviewable

@shubhamdhama

shubhamdhama commented May 13, 2026

Copy link
Copy Markdown
Contributor Author

🚨 Remove do-not-merge label checklist

  1. remove do-not-merge commits
  2. remove personal DRPC fork SHA with cockroach fork SHA

Bumps DRPC to pick up stream multiplexing from cockroachdb/drpc#58,
where a single transport connection can carry multiple concurrent
streams.

The multiplexing dial path is guarded by
COCKROACH_EXPERIMENTAL_DRPC_MUX_ENABLED, defaulted to false. With the
gate off, DialDRPC keeps using drpcpool, which checks out a connection
per active stream and dials a new one when none is idle. The upgraded
multiplexing-capable library is therefore exercised on every connection,
but the multiplexing capability itself stays unused: every stream still
ends up on its own underlying connection, matching the prior behavior.

This keeps the initial scope small. The dial-mux path is in place so
it can be turned on for validation, with a follow-up to flip the
default once we are confident.

Both paths now share the same dial-option setup, so the mux path also
picks up the client metrics and the request-recording gate that the
pool path was already attaching.

Release note: None
Epic: none
@shubhamdhama shubhamdhama force-pushed the add-stream-multiplexing branch from b4179ad to cc1edf7 Compare May 13, 2026 09:18
@cockroach-teamcity cockroach-teamcity added the X-perf-gain Microbenchmarks CI: Added if a performance gain is detected label May 13, 2026

@Nukitt Nukitt left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! approving since you're already tracking the do-not-merge items.

Comment thread pkg/rpc/drpc.go
log.Dev.Infof(ctx, "dialing DRPC mux connection to %s", target)
return dialDRPCMux(ctx, target, drpcDialOptions)
}
log.Dev.Infof(ctx, "dialing DRPC non-mux connection to %s", target)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this and the one above this make the logs a lot noisier since we'll be logging every dial. Maybe lets have some verbosity flag if thats possible.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes. I'll remove these. They were better for debugging.

@shubhamdhama shubhamdhama removed the X-perf-gain Microbenchmarks CI: Added if a performance gain is detected label May 14, 2026
@shubhamdhama

shubhamdhama commented May 15, 2026

Copy link
Copy Markdown
Contributor Author

Multiplexing off

> benchdiff --bazel --old af827678d69715b0cb134f4bc71404034632cdfc --count 20 --benchtime 1000x ./pkg/sql/tests --run Sysbench/SQL/3node/olt
p_read_write --memprofile --cpuprofile
old:  af82767 do-not-merge: enable DRPC
new:  2a00e41 do-not-merge: always on DRPC for unit tests
args: benchdiff "--bazel" "--old" "af827678d69715b0cb134f4bc71404034632cdfc" "--count" "20" "--benchtime" "1000x" "./pkg/sql/tests" "--run" "Sysbench/SQL/3node/oltp_read_write" "--memprofile" "--cpuprofile"

building benchmark binaries for af82767: do-not-merge: enable DRPC [bazel=true] 1/1 \
building benchmark binaries for 2a00e41: do-not-merge: always on DRPC for unit tests [bazel=true] 1/1 \
name                                           old time/op    new time/op    delta
Sysbench/SQL/3node/oltp_read_write-24            14.0ms ± 4%    15.0ms ± 3%  +6.60%  (p=0.000 n=20+18)
ParallelSysbench/SQL/3node/oltp_read_write-24     994µs ± 7%    1064µs ± 6%  +7.02%  (p=0.000 n=20+19)

name                                           old errs/op    new errs/op    delta
Sysbench/SQL/3node/oltp_read_write-24              0.00           0.00         ~     (all equal)
ParallelSysbench/SQL/3node/oltp_read_write-24      0.01 ±62%      0.01 ±46%    ~     (p=0.930 n=20+20)

name                                           old alloc/op   new alloc/op   delta
Sysbench/SQL/3node/oltp_read_write-24            1.34MB ± 6%    1.33MB ± 6%    ~     (p=0.779 n=20+20)
ParallelSysbench/SQL/3node/oltp_read_write-24    1.31MB ± 6%    1.36MB ± 6%  +3.45%  (p=0.001 n=20+20)

name                                           old allocs/op  new allocs/op  delta
Sysbench/SQL/3node/oltp_read_write-24             5.79k ± 2%     5.81k ± 2%    ~     (p=0.317 n=20+20)
ParallelSysbench/SQL/3node/oltp_read_write-24     5.70k ± 1%     5.72k ± 1%    ~     (p=0.101 n=19+19)

wrote merged cpu profile to:
  old=benchdiff/af82767/artifacts/profiles/merged/cpu.prof
  new=benchdiff/2a00e41/artifacts/profiles/merged/cpu.prof

wrote merged mem profile to:
  old=benchdiff/af82767/artifacts/profiles/merged/mem.pb.gz
  new=benchdiff/2a00e41/artifacts/profiles/merged/mem.pb.gz

wrote merged mem_ParallelSysbench_SQL_3node_oltp_read_write.pb.gz profile to:
  old=benchdiff/af82767/artifacts/profiles/merged/mem_ParallelSysbench_SQL_3node_oltp_read_write.pb.gz
  new=benchdiff/2a00e41/artifacts/profiles/merged/mem_ParallelSysbench_SQL_3node_oltp_read_write.pb.gz

wrote merged mem_Sysbench_SQL_3node_oltp_read_write.pb.gz profile to:
  old=benchdiff/af82767/artifacts/profiles/merged/mem_Sysbench_SQL_3node_oltp_read_write.pb.gz
  new=benchdiff/2a00e41/artifacts/profiles/merged/mem_Sysbench_SQL_3node_oltp_read_write.pb.gz

@shubhamdhama

shubhamdhama commented May 19, 2026

Copy link
Copy Markdown
Contributor Author

Multiplexing disabled

> benchdiff -b --old (git rev-parse HEAD^) -c 20 -d 1000x ./pkg/sql/tests -r Sysbench/SQL/3node/oltp_read_write
old:  c5041d3 do-not-merge: enable DRPC
new:  4f20069 rpc: wire up DRPC stream multiplexing behind an en
args: benchdiff "-b" "--old" "c5041d30a31086b50a29124d0edbaf1df09e108b" "-c" "20" "-d" "1000x" "./pkg/sql/tests" "-r" "Sysbench/SQL/3node/oltp_read_write"

name                                           old time/op    new time/op    delta
ParallelSysbench/SQL/3node/oltp_read_write-24     852µs ± 6%     840µs ± 5%    ~     (p=0.055 n=20+18)
Sysbench/SQL/3node/oltp_read_write-24            13.1ms ± 4%    13.7ms ± 2%  +4.49%  (p=0.000 n=20+15)

name                                           old errs/op    new errs/op    delta
Sysbench/SQL/3node/oltp_read_write-24              0.00           0.00         ~     (all equal)
ParallelSysbench/SQL/3node/oltp_read_write-24      0.01 ±61%      0.01 ±64%    ~     (p=0.469 n=20+19)

name                                           old alloc/op   new alloc/op   delta
Sysbench/SQL/3node/oltp_read_write-24            1.32MB ± 5%    1.32MB ± 4%    ~     (p=0.358 n=20+20)
ParallelSysbench/SQL/3node/oltp_read_write-24    1.36MB ± 7%    1.37MB ± 7%    ~     (p=0.429 n=20+20)

name                                           old allocs/op  new allocs/op  delta
ParallelSysbench/SQL/3node/oltp_read_write-24     5.94k ± 2%     5.79k ± 3%  -2.60%  (p=0.001 n=18+20)
Sysbench/SQL/3node/oltp_read_write-24             5.72k ± 0%     5.77k ± 2%  +0.81%  (p=0.015 n=19+20)

Multiplexing enabled

> benchdiff -b --old (git rev-parse HEAD^) -c 20 -d 1000x ./pkg/sql/tests -r Sysbench/SQL/3node/oltp_read_write
old:  c5041d3 do-not-merge: enable DRPC
new:  d7188c5 rpc: wire up DRPC stream multiplexing behind an en
args: benchdiff "-b" "--old" "c5041d30a31086b50a29124d0edbaf1df09e108b" "-c" "20" "-d" "1000x" "./pkg/sql/tests" "-r" "Sysbench/SQL/3node/oltp_read_write"

building benchmark binaries for d7188c5: rpc: wire up DRPC stream multiplexing behind an en [bazel=true] 1/1 |
name                                           old time/op    new time/op    delta
Sysbench/SQL/3node/oltp_read_write-24            12.9ms ± 4%    13.3ms ± 3%  +2.57%  (p=0.001 n=20+20)
ParallelSysbench/SQL/3node/oltp_read_write-24     860µs ± 5%     903µs ± 6%  +5.04%  (p=0.000 n=20+20)

name                                           old errs/op    new errs/op    delta
Sysbench/SQL/3node/oltp_read_write-24              0.00           0.00         ~     (all equal)
ParallelSysbench/SQL/3node/oltp_read_write-24      0.00 ±27%      0.01 ±75%    ~     (p=0.082 n=14+20)

name                                           old alloc/op   new alloc/op   delta
Sysbench/SQL/3node/oltp_read_write-24            1.31MB ± 7%    1.30MB ± 6%    ~     (p=0.429 n=20+20)
ParallelSysbench/SQL/3node/oltp_read_write-24    1.36MB ± 6%    1.41MB ± 6%  +3.67%  (p=0.003 n=20+20)

name                                           old allocs/op  new allocs/op  delta
Sysbench/SQL/3node/oltp_read_write-24             5.72k ± 0%     5.75k ± 2%    ~     (p=0.091 n=18+20)
ParallelSysbench/SQL/3node/oltp_read_write-24     5.95k ± 5%     5.92k ± 1%    ~     (p=0.512 n=20+17)

@shubhamdhama

Copy link
Copy Markdown
Contributor Author

Here is my analysis of understanding the (expected) regression (let me know if something isn't clear here),

benchmark results

> benchdiff -b --old (git rev-parse HEAD^) --count 20 --benchtime 1000x ./pkg/sql/tests --run Sysbench/SQL/3node/oltp_read_write --memprofile --cpuprofile
old:  15db668 do-not-merge: enable DRPC
new:  3918e6f rpc: wire up DRPC stream multiplexing behind an en
args: benchdiff "-b" "--old" "15db668172791a7587a11446afd52482b04ed221" "--count" "20" "--benchtime" "1000x" "./pkg/sql/tests" "--run" "Sysbench/SQL/3node/oltp_read_write" "--memprofile" "--cpuprofile"

building benchmark binaries for 15db668: do-not-merge: enable DRPC [bazel=true] 1/1 -
building benchmark binaries for 3918e6f: rpc: wire up DRPC stream multiplexing behind an en [bazel=true] 1/1 \
name                                           old time/op    new time/op    delta
ParallelSysbench/SQL/3node/oltp_read_write-24     944µs ± 3%     967µs ± 3%  +2.47%  (p=0.000 n=20+19)
Sysbench/SQL/3node/oltp_read_write-24            13.2ms ± 2%    13.7ms ± 3%  +3.68%  (p=0.000 n=19+19)

name                                           old errs/op    new errs/op    delta
Sysbench/SQL/3node/oltp_read_write-24              0.00           0.00         ~     (all equal)
ParallelSysbench/SQL/3node/oltp_read_write-24      0.01 ±83%      0.01 ±67%    ~     (p=0.163 n=19+20)

name                                           old alloc/op   new alloc/op   delta
Sysbench/SQL/3node/oltp_read_write-24            1.32MB ± 6%    1.34MB ± 6%    ~     (p=0.369 n=20+20)
ParallelSysbench/SQL/3node/oltp_read_write-24    1.31MB ± 8%    1.37MB ± 6%  +4.92%  (p=0.000 n=20+20)

name                                           old allocs/op  new allocs/op  delta
ParallelSysbench/SQL/3node/oltp_read_write-24     5.73k ± 1%     5.73k ± 1%    ~     (p=0.948 n=19+19)
Sysbench/SQL/3node/oltp_read_write-24             5.76k ± 1%     5.82k ± 2%  +1.13%  (p=0.001 n=18+20)

wrote merged cpu profile to:
  old=benchdiff/15db668/artifacts/profiles/merged/cpu.prof
  new=benchdiff/3918e6f/artifacts/profiles/merged/cpu.prof

wrote merged mem profile to:
  old=benchdiff/15db668/artifacts/profiles/merged/mem.pb.gz
  new=benchdiff/3918e6f/artifacts/profiles/merged/mem.pb.gz

wrote merged mem_ParallelSysbench_SQL_3node_oltp_read_write.pb.gz profile to:
  old=benchdiff/15db668/artifacts/profiles/merged/mem_ParallelSysbench_SQL_3node_oltp_read_write.pb.gz
  new=benchdiff/3918e6f/artifacts/profiles/merged/mem_ParallelSysbench_SQL_3node_oltp_read_write.pb.gz

wrote merged mem_Sysbench_SQL_3node_oltp_read_write.pb.gz profile to:
  old=benchdiff/15db668/artifacts/profiles/merged/mem_Sysbench_SQL_3node_oltp_read_write.pb.gz
  new=benchdiff/3918e6f/artifacts/profiles/merged/mem_Sysbench_SQL_3node_oltp_read_write.pb.gz

TLDR

Turning on DRPC stream multiplexing costs us about 2.5% on the parallel
Sysbench variant and 3.7% on the serial one, along with a 4.9% bump in
alloc/op on the parallel variant. The cost is split across the two transport
paths that multiplexing reworked, the send side and the receive side.

On send, the old path flushed inline: Stream.MsgSend to rawFlushLocked to
Writer.Flush wrote the bytes straight to the transport on the caller's own
goroutine. With multiplexing, MsgSend frames the message, appends it to a
shared MuxWriter buffer, and signals a dedicated MuxWriter.run goroutine
that drains the buffer to the transport. Every message now pays a cross-goroutine
handoff (a cond signal plus a wakeup) instead of writing inline.

On receive, the old path assembled packets in the reader and handed them up
directly. With multiplexing, the reader parses frames and copies each message
into a per-stream ring buffer (recvQueue), which the application goroutine
drains in MsgRecv. Every message now pays a copy into the ring plus a
producer/consumer handoff.

Most of the transport CPU shifted rather than being added: inline-flush work
became MuxWriter.run work on send, and packet assembly became frame-read work
on receive, each roughly a wash in aggregate. The net regression is the two
genuinely new costs, the per-message copy into the receive ring (the memmove
and the allocation) and the extra goroutine handoff on send (the futex).

Analysis

Here is the normalized CPU diff, so we get a fair comparison between old and
new:

OLD=benchdiff/15db668/artifacts/profiles/merged
NEW=benchdiff/3918e6f/artifacts/profiles/merged
go tool pprof -top -flat -nodecount=40 -normalize -diff_base=$OLD/cpu.prof $NEW/cpu.prof

The functions that grow sit on the new send and receive paths:

  +143.5s  drpcwire.(*MuxWriter).run         (cum)    send: dedicated drain goroutine
   +78.8s  drpcwire.(*Reader).ReadFrame      (cum)    receive: frame reader
   +16.3s  drpcstream.(*ringBuffer).Enqueue  (cum)    receive: copy into per-stream ring
   11.06s  runtime.memmove                   (flat)   message copies (receive ring + assembly)
    7.15s  runtime.futex                     (flat)   goroutine wakeups for the new handoffs

The functions that shrink are the old single-stream paths the new code replaced,
inline send and packet assembly:

  -130.7s  drpcstream.(*Stream).MsgSend         (cum)   old: inline send
  -116.5s  drpcwire.(*Writer).Flush             (cum)   old: inline flush
  -115.7s  drpcstream.(*Stream).rawFlushLocked  (cum)
   -91.9s  drpcwire.(*Reader).ReadPacketUsing   (cum)   old: packet assembly

Lined up against each other, most of the transport CPU is the same work
relocated: send moved from inline flushing into MuxWriter.run, receive moved
from packet assembly into frame reading. What is left over, and what the CPU
regression actually is, are the two new costs: the memmove from copying each
message into the receive ring, and the futex from the extra goroutine handoff
the writer adds on send.

For memory, we look at the allocation diff on the parallel variant:

go tool pprof -top -flat -nodecount=25 -normalize -sample_index=alloc_space \
  -diff_base=$OLD/mem_ParallelSysbench_SQL_3node_oltp_read_write.pb.gz \
     $NEW/mem_ParallelSysbench_SQL_3node_oltp_read_write.pb.gz

The result is even more concentrated. A single function dominates:

   1.25GB  4.37%  storj.io/drpc/drpcstream.(*ringBuffer).Enqueue

Everything else in the diff is negative or noise. The ring buffer allocates on
each enqueue, and that one site explains essentially all of the 4.9% alloc/op
increase. Switching to -sample_index=inuse_space shows the same site at
+564MB, 22% of the live heap in the diff, so this is not just short-lived churn:
a meaningful share of what the ring buffer allocates is still held live at the
snapshot.

The reason is the ring's shape. Each stream owns its own 256-slot ring, and the
write index advances on every enqueue (drpcstream/ring_buffer.go), so within
the first 256 messages every slot acquires its own message-sized backing array
and keeps it for the life of the stream, even though the consumer keeps up and
only a slot or two is ever occupied at once. Resident memory therefore scales
with streams times 256 times message size, not with the queue depth actually
used, which is why a modest in-flight working set shows up as a large retained
footprint.

The clearest single picture is the differential flame graph, where red marks
time spent more in new and green marks time spent less:

go tool pprof -http=:8081 -normalize -diff_base=$OLD/cpu.prof $NEW/cpu.prof

In flame graph view the MuxWriter, ringBuffer.Enqueue, and memmove frames
light up red, while the old Flush and rawFlushLocked frames are green. That
matches the numbers above: the regression is the cost of the new multiplexed
paths, a buffered single-writer on send and a per-stream ring with a copy on
receive, replacing the old inline send and packet assembly.

@shubhamdhama shubhamdhama closed this by deleting the head repository Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge bors won't merge a PR with this label.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants