rpc: add DRPC stream multiplexing behind an env gate by shubhamdhama · Pull Request #170248 · cockroachdb/cockroach

shubhamdhama · 2026-05-13T08:41:15Z

Bumps DRPC to pick up stream multiplexing from cockroachdb/drpc#58,
where a single transport connection can carry multiple concurrent
streams.

The multiplexing dial path is guarded by
COCKROACH_EXPERIMENTAL_DRPC_MUX_ENABLED, defaulted to false. With the
gate off, DialDRPC keeps using drpcpool, which checks out a connection
per active stream and dials a new one when none is idle. The upgraded
multiplexing-capable library is therefore exercised on every connection,
but the multiplexing capability itself stays unused: every stream still
ends up on its own underlying connection, matching the prior behavior.

This keeps the initial scope small. The dial-mux path is in place so
it can be turned on for validation, with a follow-up to flip the
default once we are confident.

Both paths now share the same dial-option setup, so the mux path also
picks up the client metrics and the request-recording gate that the
pool path was already attaching.

Release note: None
Epic: none

Adds a small helper that bumps the DRPC replace directive in go.mod to a specific commit. It takes a GitHub commit URL (cockroachdb or shubhamdhama fork), resolves the pseudo-version with `go list`, rewrites the replace line, runs `go mod tidy`, and regenerates the Bazel files via `./dev generate bazel --mirror`. Bumping DRPC was a multi-step ritual; the script removes the chance of forgetting one of the steps. Release note: None Epic: none

trunk-io · 2026-05-13T08:41:20Z

Merging to master in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

cockroach-teamcity · 2026-05-13T08:41:32Z

This change is

shubhamdhama · 2026-05-13T08:42:12Z

🚨 Remove do-not-merge label checklist

remove do-not-merge commits
remove personal DRPC fork SHA with cockroach fork SHA

Bumps DRPC to pick up stream multiplexing from cockroachdb/drpc#58, where a single transport connection can carry multiple concurrent streams. The multiplexing dial path is guarded by COCKROACH_EXPERIMENTAL_DRPC_MUX_ENABLED, defaulted to false. With the gate off, DialDRPC keeps using drpcpool, which checks out a connection per active stream and dials a new one when none is idle. The upgraded multiplexing-capable library is therefore exercised on every connection, but the multiplexing capability itself stays unused: every stream still ends up on its own underlying connection, matching the prior behavior. This keeps the initial scope small. The dial-mux path is in place so it can be turned on for validation, with a follow-up to flip the default once we are confident. Both paths now share the same dial-option setup, so the mux path also picks up the client metrics and the request-recording gate that the pool path was already attaching. Release note: None Epic: none

Nukitt

LGTM! approving since you're already tracking the do-not-merge items.

Nukitt · 2026-05-13T11:32:53Z

+			log.Dev.Infof(ctx, "dialing DRPC mux connection to %s", target)
+			return dialDRPCMux(ctx, target, drpcDialOptions)
+		}
+		log.Dev.Infof(ctx, "dialing DRPC non-mux connection to %s", target)


Won't this and the one above this make the logs a lot noisier since we'll be logging every dial. Maybe lets have some verbosity flag if thats possible.

Oh yes. I'll remove these. They were better for debugging.

shubhamdhama · 2026-05-15T12:13:00Z

Multiplexing off

> benchdiff --bazel --old af827678d69715b0cb134f4bc71404034632cdfc --count 20 --benchtime 1000x ./pkg/sql/tests --run Sysbench/SQL/3node/olt
p_read_write --memprofile --cpuprofile
old:  af82767 do-not-merge: enable DRPC
new:  2a00e41 do-not-merge: always on DRPC for unit tests
args: benchdiff "--bazel" "--old" "af827678d69715b0cb134f4bc71404034632cdfc" "--count" "20" "--benchtime" "1000x" "./pkg/sql/tests" "--run" "Sysbench/SQL/3node/oltp_read_write" "--memprofile" "--cpuprofile"

building benchmark binaries for af82767: do-not-merge: enable DRPC [bazel=true] 1/1 \
building benchmark binaries for 2a00e41: do-not-merge: always on DRPC for unit tests [bazel=true] 1/1 \
name                                           old time/op    new time/op    delta
Sysbench/SQL/3node/oltp_read_write-24            14.0ms ± 4%    15.0ms ± 3%  +6.60%  (p=0.000 n=20+18)
ParallelSysbench/SQL/3node/oltp_read_write-24     994µs ± 7%    1064µs ± 6%  +7.02%  (p=0.000 n=20+19)

name                                           old errs/op    new errs/op    delta
Sysbench/SQL/3node/oltp_read_write-24              0.00           0.00         ~     (all equal)
ParallelSysbench/SQL/3node/oltp_read_write-24      0.01 ±62%      0.01 ±46%    ~     (p=0.930 n=20+20)

name                                           old alloc/op   new alloc/op   delta
Sysbench/SQL/3node/oltp_read_write-24            1.34MB ± 6%    1.33MB ± 6%    ~     (p=0.779 n=20+20)
ParallelSysbench/SQL/3node/oltp_read_write-24    1.31MB ± 6%    1.36MB ± 6%  +3.45%  (p=0.001 n=20+20)

name                                           old allocs/op  new allocs/op  delta
Sysbench/SQL/3node/oltp_read_write-24             5.79k ± 2%     5.81k ± 2%    ~     (p=0.317 n=20+20)
ParallelSysbench/SQL/3node/oltp_read_write-24     5.70k ± 1%     5.72k ± 1%    ~     (p=0.101 n=19+19)

wrote merged cpu profile to:
  old=benchdiff/af82767/artifacts/profiles/merged/cpu.prof
  new=benchdiff/2a00e41/artifacts/profiles/merged/cpu.prof

wrote merged mem profile to:
  old=benchdiff/af82767/artifacts/profiles/merged/mem.pb.gz
  new=benchdiff/2a00e41/artifacts/profiles/merged/mem.pb.gz

wrote merged mem_ParallelSysbench_SQL_3node_oltp_read_write.pb.gz profile to:
  old=benchdiff/af82767/artifacts/profiles/merged/mem_ParallelSysbench_SQL_3node_oltp_read_write.pb.gz
  new=benchdiff/2a00e41/artifacts/profiles/merged/mem_ParallelSysbench_SQL_3node_oltp_read_write.pb.gz

wrote merged mem_Sysbench_SQL_3node_oltp_read_write.pb.gz profile to:
  old=benchdiff/af82767/artifacts/profiles/merged/mem_Sysbench_SQL_3node_oltp_read_write.pb.gz
  new=benchdiff/2a00e41/artifacts/profiles/merged/mem_Sysbench_SQL_3node_oltp_read_write.pb.gz

shubhamdhama · 2026-05-19T07:17:58Z

Multiplexing disabled

> benchdiff -b --old (git rev-parse HEAD^) -c 20 -d 1000x ./pkg/sql/tests -r Sysbench/SQL/3node/oltp_read_write
old:  c5041d3 do-not-merge: enable DRPC
new:  4f20069 rpc: wire up DRPC stream multiplexing behind an en
args: benchdiff "-b" "--old" "c5041d30a31086b50a29124d0edbaf1df09e108b" "-c" "20" "-d" "1000x" "./pkg/sql/tests" "-r" "Sysbench/SQL/3node/oltp_read_write"

name                                           old time/op    new time/op    delta
ParallelSysbench/SQL/3node/oltp_read_write-24     852µs ± 6%     840µs ± 5%    ~     (p=0.055 n=20+18)
Sysbench/SQL/3node/oltp_read_write-24            13.1ms ± 4%    13.7ms ± 2%  +4.49%  (p=0.000 n=20+15)

name                                           old errs/op    new errs/op    delta
Sysbench/SQL/3node/oltp_read_write-24              0.00           0.00         ~     (all equal)
ParallelSysbench/SQL/3node/oltp_read_write-24      0.01 ±61%      0.01 ±64%    ~     (p=0.469 n=20+19)

name                                           old alloc/op   new alloc/op   delta
Sysbench/SQL/3node/oltp_read_write-24            1.32MB ± 5%    1.32MB ± 4%    ~     (p=0.358 n=20+20)
ParallelSysbench/SQL/3node/oltp_read_write-24    1.36MB ± 7%    1.37MB ± 7%    ~     (p=0.429 n=20+20)

name                                           old allocs/op  new allocs/op  delta
ParallelSysbench/SQL/3node/oltp_read_write-24     5.94k ± 2%     5.79k ± 3%  -2.60%  (p=0.001 n=18+20)
Sysbench/SQL/3node/oltp_read_write-24             5.72k ± 0%     5.77k ± 2%  +0.81%  (p=0.015 n=19+20)

Multiplexing enabled

> benchdiff -b --old (git rev-parse HEAD^) -c 20 -d 1000x ./pkg/sql/tests -r Sysbench/SQL/3node/oltp_read_write
old:  c5041d3 do-not-merge: enable DRPC
new:  d7188c5 rpc: wire up DRPC stream multiplexing behind an en
args: benchdiff "-b" "--old" "c5041d30a31086b50a29124d0edbaf1df09e108b" "-c" "20" "-d" "1000x" "./pkg/sql/tests" "-r" "Sysbench/SQL/3node/oltp_read_write"

building benchmark binaries for d7188c5: rpc: wire up DRPC stream multiplexing behind an en [bazel=true] 1/1 |
name                                           old time/op    new time/op    delta
Sysbench/SQL/3node/oltp_read_write-24            12.9ms ± 4%    13.3ms ± 3%  +2.57%  (p=0.001 n=20+20)
ParallelSysbench/SQL/3node/oltp_read_write-24     860µs ± 5%     903µs ± 6%  +5.04%  (p=0.000 n=20+20)

name                                           old errs/op    new errs/op    delta
Sysbench/SQL/3node/oltp_read_write-24              0.00           0.00         ~     (all equal)
ParallelSysbench/SQL/3node/oltp_read_write-24      0.00 ±27%      0.01 ±75%    ~     (p=0.082 n=14+20)

name                                           old alloc/op   new alloc/op   delta
Sysbench/SQL/3node/oltp_read_write-24            1.31MB ± 7%    1.30MB ± 6%    ~     (p=0.429 n=20+20)
ParallelSysbench/SQL/3node/oltp_read_write-24    1.36MB ± 6%    1.41MB ± 6%  +3.67%  (p=0.003 n=20+20)

name                                           old allocs/op  new allocs/op  delta
Sysbench/SQL/3node/oltp_read_write-24             5.72k ± 0%     5.75k ± 2%    ~     (p=0.091 n=18+20)
ParallelSysbench/SQL/3node/oltp_read_write-24     5.95k ± 5%     5.92k ± 1%    ~     (p=0.512 n=20+17)

shubhamdhama · 2026-06-01T07:16:47Z

Here is my analysis of understanding the (expected) regression (let me know if something isn't clear here),

benchmark results

> benchdiff -b --old (git rev-parse HEAD^) --count 20 --benchtime 1000x ./pkg/sql/tests --run Sysbench/SQL/3node/oltp_read_write --memprofile --cpuprofile
old:  15db668 do-not-merge: enable DRPC
new:  3918e6f rpc: wire up DRPC stream multiplexing behind an en
args: benchdiff "-b" "--old" "15db668172791a7587a11446afd52482b04ed221" "--count" "20" "--benchtime" "1000x" "./pkg/sql/tests" "--run" "Sysbench/SQL/3node/oltp_read_write" "--memprofile" "--cpuprofile"

building benchmark binaries for 15db668: do-not-merge: enable DRPC [bazel=true] 1/1 -
building benchmark binaries for 3918e6f: rpc: wire up DRPC stream multiplexing behind an en [bazel=true] 1/1 \
name                                           old time/op    new time/op    delta
ParallelSysbench/SQL/3node/oltp_read_write-24     944µs ± 3%     967µs ± 3%  +2.47%  (p=0.000 n=20+19)
Sysbench/SQL/3node/oltp_read_write-24            13.2ms ± 2%    13.7ms ± 3%  +3.68%  (p=0.000 n=19+19)

name                                           old errs/op    new errs/op    delta
Sysbench/SQL/3node/oltp_read_write-24              0.00           0.00         ~     (all equal)
ParallelSysbench/SQL/3node/oltp_read_write-24      0.01 ±83%      0.01 ±67%    ~     (p=0.163 n=19+20)

name                                           old alloc/op   new alloc/op   delta
Sysbench/SQL/3node/oltp_read_write-24            1.32MB ± 6%    1.34MB ± 6%    ~     (p=0.369 n=20+20)
ParallelSysbench/SQL/3node/oltp_read_write-24    1.31MB ± 8%    1.37MB ± 6%  +4.92%  (p=0.000 n=20+20)

name                                           old allocs/op  new allocs/op  delta
ParallelSysbench/SQL/3node/oltp_read_write-24     5.73k ± 1%     5.73k ± 1%    ~     (p=0.948 n=19+19)
Sysbench/SQL/3node/oltp_read_write-24             5.76k ± 1%     5.82k ± 2%  +1.13%  (p=0.001 n=18+20)

wrote merged cpu profile to:
  old=benchdiff/15db668/artifacts/profiles/merged/cpu.prof
  new=benchdiff/3918e6f/artifacts/profiles/merged/cpu.prof

wrote merged mem profile to:
  old=benchdiff/15db668/artifacts/profiles/merged/mem.pb.gz
  new=benchdiff/3918e6f/artifacts/profiles/merged/mem.pb.gz

wrote merged mem_ParallelSysbench_SQL_3node_oltp_read_write.pb.gz profile to:
  old=benchdiff/15db668/artifacts/profiles/merged/mem_ParallelSysbench_SQL_3node_oltp_read_write.pb.gz
  new=benchdiff/3918e6f/artifacts/profiles/merged/mem_ParallelSysbench_SQL_3node_oltp_read_write.pb.gz

wrote merged mem_Sysbench_SQL_3node_oltp_read_write.pb.gz profile to:
  old=benchdiff/15db668/artifacts/profiles/merged/mem_Sysbench_SQL_3node_oltp_read_write.pb.gz
  new=benchdiff/3918e6f/artifacts/profiles/merged/mem_Sysbench_SQL_3node_oltp_read_write.pb.gz

TLDR

Turning on DRPC stream multiplexing costs us about 2.5% on the parallel
Sysbench variant and 3.7% on the serial one, along with a 4.9% bump in
alloc/op on the parallel variant. The cost is split across the two transport
paths that multiplexing reworked, the send side and the receive side.

On send, the old path flushed inline: Stream.MsgSend to rawFlushLocked to
Writer.Flush wrote the bytes straight to the transport on the caller's own
goroutine. With multiplexing, MsgSend frames the message, appends it to a
shared MuxWriter buffer, and signals a dedicated MuxWriter.run goroutine
that drains the buffer to the transport. Every message now pays a cross-goroutine
handoff (a cond signal plus a wakeup) instead of writing inline.

On receive, the old path assembled packets in the reader and handed them up
directly. With multiplexing, the reader parses frames and copies each message
into a per-stream ring buffer (recvQueue), which the application goroutine
drains in MsgRecv. Every message now pays a copy into the ring plus a
producer/consumer handoff.

Most of the transport CPU shifted rather than being added: inline-flush work
became MuxWriter.run work on send, and packet assembly became frame-read work
on receive, each roughly a wash in aggregate. The net regression is the two
genuinely new costs, the per-message copy into the receive ring (the memmove
and the allocation) and the extra goroutine handoff on send (the futex).

Analysis

Here is the normalized CPU diff, so we get a fair comparison between old and
new:

OLD=benchdiff/15db668/artifacts/profiles/merged
NEW=benchdiff/3918e6f/artifacts/profiles/merged
go tool pprof -top -flat -nodecount=40 -normalize -diff_base=$OLD/cpu.prof $NEW/cpu.prof

The functions that grow sit on the new send and receive paths:

  +143.5s  drpcwire.(*MuxWriter).run         (cum)    send: dedicated drain goroutine
   +78.8s  drpcwire.(*Reader).ReadFrame      (cum)    receive: frame reader
   +16.3s  drpcstream.(*ringBuffer).Enqueue  (cum)    receive: copy into per-stream ring
   11.06s  runtime.memmove                   (flat)   message copies (receive ring + assembly)
    7.15s  runtime.futex                     (flat)   goroutine wakeups for the new handoffs

The functions that shrink are the old single-stream paths the new code replaced,
inline send and packet assembly:

  -130.7s  drpcstream.(*Stream).MsgSend         (cum)   old: inline send
  -116.5s  drpcwire.(*Writer).Flush             (cum)   old: inline flush
  -115.7s  drpcstream.(*Stream).rawFlushLocked  (cum)
   -91.9s  drpcwire.(*Reader).ReadPacketUsing   (cum)   old: packet assembly

Lined up against each other, most of the transport CPU is the same work
relocated: send moved from inline flushing into MuxWriter.run, receive moved
from packet assembly into frame reading. What is left over, and what the CPU
regression actually is, are the two new costs: the memmove from copying each
message into the receive ring, and the futex from the extra goroutine handoff
the writer adds on send.

For memory, we look at the allocation diff on the parallel variant:

go tool pprof -top -flat -nodecount=25 -normalize -sample_index=alloc_space \
  -diff_base=$OLD/mem_ParallelSysbench_SQL_3node_oltp_read_write.pb.gz \
     $NEW/mem_ParallelSysbench_SQL_3node_oltp_read_write.pb.gz

The result is even more concentrated. A single function dominates:

   1.25GB  4.37%  storj.io/drpc/drpcstream.(*ringBuffer).Enqueue

Everything else in the diff is negative or noise. The ring buffer allocates on
each enqueue, and that one site explains essentially all of the 4.9% alloc/op
increase. Switching to -sample_index=inuse_space shows the same site at
+564MB, 22% of the live heap in the diff, so this is not just short-lived churn:
a meaningful share of what the ring buffer allocates is still held live at the
snapshot.

The reason is the ring's shape. Each stream owns its own 256-slot ring, and the
write index advances on every enqueue (drpcstream/ring_buffer.go), so within
the first 256 messages every slot acquires its own message-sized backing array
and keeps it for the life of the stream, even though the consumer keeps up and
only a slot or two is ever occupied at once. Resident memory therefore scales
with streams times 256 times message size, not with the queue depth actually
used, which is why a modest in-flight working set shows up as a large retained
footprint.

The clearest single picture is the differential flame graph, where red marks
time spent more in new and green marks time spent less:

go tool pprof -http=:8081 -normalize -diff_base=$OLD/cpu.prof $NEW/cpu.prof

In flame graph view the MuxWriter, ringBuffer.Enqueue, and memmove frames
light up red, while the old Flush and rawFlushLocked frames are green. That
matches the numbers above: the regression is the cost of the new multiplexed
paths, a buffered single-writer on send and a per-stream ring with a copy on
receive, replacing the old inline send and packet assembly.

shubhamdhama added 2 commits May 13, 2026 07:42

do-not-merge: enable DRPC

18c8ff9

shubhamdhama requested a review from a team May 13, 2026 08:41

shubhamdhama requested review from a team as code owners May 13, 2026 08:41

shubhamdhama added the do-not-merge bors won't merge a PR with this label. label May 13, 2026

shubhamdhama added 2 commits May 13, 2026 09:18

do-not-merge: always on DRPC for unit tests

cc1edf7

shubhamdhama force-pushed the add-stream-multiplexing branch from b4179ad to cc1edf7 Compare May 13, 2026 09:18

cockroach-teamcity added the X-perf-gain Microbenchmarks CI: Added if a performance gain is detected label May 13, 2026

Nukitt approved these changes May 13, 2026

View reviewed changes

shubhamdhama mentioned this pull request May 14, 2026

rpc: adopt DRPC multiplexing infrastructure (one stream per connection) #170327

Closed

shubhamdhama removed the X-perf-gain Microbenchmarks CI: Added if a performance gain is detected label May 14, 2026

kev-cao mentioned this pull request May 19, 2026

backup: re-enable metamorphic DRPC on unit tests #170570

Closed

This was referenced May 26, 2026

RCA for backup: TestBackupRestoreSystemTables failed #169700

Closed

rpc: investigate and re-enable DRPC for flaky backup tests #169538

Open

shubhamdhama closed this by deleting the head repository Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: add DRPC stream multiplexing behind an env gate#170248

rpc: add DRPC stream multiplexing behind an env gate#170248
shubhamdhama wants to merge 4 commits into
cockroachdb:masterfrom
shubhamdhama:add-stream-multiplexing

shubhamdhama commented May 13, 2026

Uh oh!

trunk-io Bot commented May 13, 2026

Uh oh!

cockroach-teamcity commented May 13, 2026

Uh oh!

shubhamdhama commented May 13, 2026 •

edited

Loading

Uh oh!

Nukitt left a comment

Uh oh!

Nukitt May 13, 2026

Uh oh!

shubhamdhama May 13, 2026

Uh oh!

shubhamdhama commented May 15, 2026 •

edited

Loading

Uh oh!

shubhamdhama commented May 19, 2026 •

edited

Loading

Uh oh!

shubhamdhama commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shubhamdhama commented May 13, 2026

Uh oh!

trunk-io Bot commented May 13, 2026

Uh oh!

cockroach-teamcity commented May 13, 2026

Uh oh!

shubhamdhama commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nukitt left a comment

Choose a reason for hiding this comment

Uh oh!

Nukitt May 13, 2026

Choose a reason for hiding this comment

Uh oh!

shubhamdhama May 13, 2026

Choose a reason for hiding this comment

Uh oh!

shubhamdhama commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shubhamdhama commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shubhamdhama commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shubhamdhama commented May 13, 2026 •

edited

Loading

shubhamdhama commented May 15, 2026 •

edited

Loading

shubhamdhama commented May 19, 2026 •

edited

Loading