Skip to content

Avoid Arrow IPC Copies#10044

Open
Rich-T-kid wants to merge 14 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/optimize-arrow-ipc-copies
Open

Avoid Arrow IPC Copies#10044
Rich-T-kid wants to merge 14 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/optimize-arrow-ipc-copies

Conversation

@Rich-T-kid
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Compression is the most compute and memory intensive part of the arrow-ipc encoding pipeline. It runs per buffer, not per record batch. For a Flight stream of 10 batches with 5 primitive arrays each, that is 100 compression calls minimum, more for string and struct arrays. Each of those calls produced an owned compressed Vec that was then copied a second
time into a flat arrow_data accumulator before being written to the output. For the uncompressed path the situation was the same: Arc-backed buffer slices that required no compression were still copied into that accumulator unnecessarily.

Separately, the original write_message() function flushed after every dictionary and every record batch, causing repeated small OS write calls per batch.
The goal was to eliminate both problems: stop copying buffers that do not need to be copied, and stop flushing on every message.

What changes are included in this PR?

  • Introduced EncodedBuffer, an enum that wraps either a raw Arc-backed Buffer for the uncompressed path or an owned Vec for the compressed path, so both can be held in a uniform collection without an extra copy into a flat accumulator
  • Changed write_array_data to push EncodedBuffer segments instead of copying bytes into arrow_data
  • Added write_batch_direct on IpcDataGenerator which writes the FlatBuffer metadata header first, then streams each EncodedBuffer segment directly to the writer with per-buffer alignment padding, never assembling an intermediate flat Vec for the body
  • FileWriter and StreamWriter both now call write_batch_direct(), eliminating the flush-per-message behavior and the intermediate copy on the hot path

Are these changes tested?

These changes are intended to be completely seamless. I didn't write new unit test for the code as nothing externally changed. all test still pass

benchmarks

[main -> cargo bench --bench ipc_writer -- "StreamWriter/write_10$" --sample-size 100]
[my branch -> cargo bench --bench ipc_writer -- "StreamWriter/write_10$" --sample-size 100 ]
Image 6-1-26 at 3 19 PM

[main -> cargo bench --bench ipc_writer -- --sample-size 1000]
[my branch -> cargo bench --bench ipc_writer -- --sample-size 1000]
Image 6-1-26 at 3 20 PM

Are there any user-facing changes?

no

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Jun 1, 2026
Comment thread arrow-ipc/src/writer.rs Outdated
Copy link
Copy Markdown
Contributor

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good. Good job! left some comments mainly directed towards exploring more reuse and bringing a bit more clarity to this file, let me know if you have other ideas.

Comment thread arrow-ipc/src/writer.rs
Comment thread arrow-ipc/src/writer.rs Outdated
Comment thread arrow-ipc/src/writer.rs Outdated
}

let (encoded_dictionaries, encoded_message) = self.data_gen.encode(
let (dict_sizes, (meta, data)) = self.data_gen.write_batch_direct(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two last (meta, data) fields returned by write_batch_direct are named (aligned_size, body_len) in that function.

Is this correct? not sure if this is just a naming thing, but it's hard to know if this is correct given the different naming. Is meta == aligned_size and data == body_len? they sound like completely different things.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the struct & variable names to hopefully make this clearer.

Comment thread arrow-ipc/src/writer.rs Outdated
/// each buffer is compressed into a per-buffer scratch `Vec<u8>` and written from
/// there, eliminating the extra copy that `write_buffer` -> `arrow_data` ->
/// `write_body_buffers` would otherwise incur.
fn write_batch_direct<W: Write>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see most contents of this function are essentially copy-pastes from record_batch_to_bytes, duplication seems too much here. Is there any chance to:

  • Completely replace record_batch_to_bytes and keep just a single function for writing batches in IPC format
  • Factoring out some ergonomic helpers that could be reused in both functions?

Also, it seems like the .encode() method and the new .write_batch_direct() are both doing the same thing with slightly different ergonomics. Do you see any opportunity to collapse them into just 1 method?

This file is overall pretty bloated with complex logic and a relatively arbitrary separation of concerns between methods, the more we can do for debloating it the better it will be for future maintainers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. the main issue is that FileWriter needs metadata while both StreamWriter and arrow-flight do not. Its better to not compute metadata the caller will not use but the slowdown should be negligible.

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

Rich-T-kid commented Jun 3, 2026

Benchmark results from #10031

ran cargo bench --bench flight encode -- --sample-size 100
Image 6-3-26 at 1 49 PM
Image 6-3-26 at 1 49 PM
Image 6-3-26 at 1 49 PM (1)

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

going to look into why fixed/9182x1 regressed. Might just be noise

@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from fc1dbb8 to ecb37f2 Compare June 3, 2026 18:46
@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

I think its worth mentioning that no dictionary optimizations were made in the PR, could make to make that a follow up ticket.

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

Ran benchmarks again and the results still look good. Test are passing for arrow-flight & arrow-ipc.
@alamb could you run the benchmarks for this PR on the CI bot when you get a chance? thank you!

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 4, 2026

run benchmark flight

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4621154654-431-g2mb7 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (04e7992) to 97f4b14 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                         main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                         ----                                   ------------------------------------
encode/dict/65536x1           1.02    274.7±1.07µs   915.2 MB/sec    1.00    270.3±1.55µs   930.2 MB/sec
encode/dict/65536x8           1.30      5.5±0.04ms   365.9 MB/sec    1.00      4.2±0.06ms   474.4 MB/sec
encode/dict/8192x1            1.01     35.3±0.04µs   924.9 MB/sec    1.00     35.1±0.05µs   930.1 MB/sec
encode/dict/8192x8            1.10    315.3±1.66µs   828.7 MB/sec    1.00    286.1±2.01µs   913.3 MB/sec
encode/fixed/65536x1          1.00     10.0±0.02µs    48.8 GB/sec    1.00     10.0±0.04µs    49.1 GB/sec
encode/fixed/65536x8          1.01   1092.1±2.65µs     3.6 GB/sec    1.00   1076.9±5.83µs     3.6 GB/sec
encode/fixed/8192x1           1.03      3.2±0.01µs    19.3 GB/sec    1.00      3.1±0.02µs    19.8 GB/sec
encode/fixed/8192x8           1.07     17.5±0.04µs    27.9 GB/sec    1.00     16.4±0.07µs    29.8 GB/sec
encode/nested/65536x1         1.00     29.0±0.24µs    42.1 GB/sec    1.04     30.2±0.30µs    40.4 GB/sec
encode/nested/65536x8         1.12      2.5±0.08ms     3.9 GB/sec    1.00      2.2±0.13ms     4.4 GB/sec
encode/nested/8192x1          1.00      5.8±0.01µs    26.5 GB/sec    1.00      5.8±0.01µs    26.6 GB/sec
encode/nested/8192x8          1.01     46.3±0.10µs    26.4 GB/sec    1.00     45.9±0.56µs    26.6 GB/sec
encode/variable/65536x1       1.04     51.5±0.55µs    42.7 GB/sec    1.00     49.3±0.36µs    44.6 GB/sec
encode/variable/65536x8       1.24      5.7±0.09ms     3.1 GB/sec    1.00      4.6±0.06ms     3.8 GB/sec
encode/variable/8192x1        1.18      7.1±0.01µs    38.8 GB/sec    1.00      6.0±0.01µs    45.8 GB/sec
encode/variable/8192x8        1.32     83.3±1.96µs    26.4 GB/sec    1.00     63.3±0.20µs    34.7 GB/sec
roundtrip/dict/65536x1        1.01  1288.9±47.00µs   195.1 MB/sec    1.00  1279.9±51.20µs   196.4 MB/sec
roundtrip/dict/65536x8        1.00     14.4±0.55ms   139.8 MB/sec    1.01     14.5±0.53ms   138.4 MB/sec
roundtrip/dict/8192x1         1.00    206.6±5.60µs   158.1 MB/sec    1.00    206.0±5.73µs   158.6 MB/sec
roundtrip/dict/8192x8         1.00  1322.8±44.85µs   197.5 MB/sec    1.00  1324.0±44.79µs   197.3 MB/sec
roundtrip/fixed/65536x1       1.02    314.2±4.02µs  1591.6 MB/sec    1.00    307.7±4.21µs  1625.2 MB/sec
roundtrip/fixed/65536x8       1.00      2.2±0.03ms  1856.3 MB/sec    1.15      2.5±0.13ms  1608.9 MB/sec
roundtrip/fixed/8192x1        1.00     89.9±0.94µs   696.0 MB/sec    1.00     89.6±1.04µs   698.6 MB/sec
roundtrip/fixed/8192x8        1.01    333.8±3.84µs  1499.9 MB/sec    1.00    330.7±3.14µs  1514.2 MB/sec
roundtrip/nested/65536x1      1.00   859.6±40.22µs  1454.4 MB/sec    1.00   862.4±44.79µs  1449.6 MB/sec
roundtrip/nested/65536x8      1.01      8.6±0.36ms  1156.8 MB/sec    1.00      8.6±0.36ms  1168.2 MB/sec
roundtrip/nested/8192x1       1.01    159.3±5.82µs   981.9 MB/sec    1.00    157.6±5.86µs   993.0 MB/sec
roundtrip/nested/8192x8       1.02   925.9±41.73µs  1351.8 MB/sec    1.00   910.3±44.09µs  1374.9 MB/sec
roundtrip/variable/65536x1    1.00  1246.2±34.07µs  1805.6 MB/sec    1.54  1922.3±132.51µs  1170.5 MB/sec
roundtrip/variable/65536x8    1.17     16.1±0.50ms  1119.3 MB/sec    1.00     13.7±0.50ms  1313.5 MB/sec
roundtrip/variable/8192x1     1.00    204.0±5.40µs  1379.8 MB/sec    1.00    203.1±5.65µs  1385.4 MB/sec
roundtrip/variable/8192x8     1.00  1211.3±26.93µs  1858.7 MB/sec    1.57  1897.2±120.59µs  1186.7 MB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 345.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 350.2s
CPU sys 74.8s
Peak spill 0 B

branch

Metric Value
Wall time 335.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 335.9s
CPU sys 78.4s
Peak spill 0 B

File an issue against this benchmark runner

@gabotechs
Copy link
Copy Markdown
Contributor

run benchmarks ipc_writer

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4621805130-432-xw4gv 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (04e7992) to 97f4b14 (merge-base) diff
BENCH_NAME=ipc_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench ipc_writer
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                                 main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                                                 ----                                   ------------------------------------
arrow_ipc_stream_writer/FileWriter/write_10           1.90    185.2±2.17µs        ? ?/sec    1.00     97.3±4.50µs        ? ?/sec
arrow_ipc_stream_writer/StreamWriter/write_10         1.94    185.1±1.96µs        ? ?/sec    1.00     95.4±4.87µs        ? ?/sec
arrow_ipc_stream_writer/StreamWriter/write_10/zstd    1.01      7.3±0.02ms        ? ?/sec    1.00      7.2±0.03ms        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 30.0s
Peak memory 2.7 GiB
Avg memory 2.6 GiB
CPU user 27.5s
CPU sys 0.6s
Peak spill 0 B

branch

Metric Value
Wall time 30.0s
Peak memory 2.7 GiB
Avg memory 2.6 GiB
CPU user 29.7s
CPU sys 0.1s
Peak spill 0 B

File an issue against this benchmark runner

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

Rich-T-kid commented Jun 4, 2026

🤔 results look good, Im curious as to why two of the roundtrip benchmarks were slightly slower even thought encode() is faster across the board.

@gabotechs
Copy link
Copy Markdown
Contributor

run benchmark flight

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4622787465-441-9sh6d 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (04e7992) to 97f4b14 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                         main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                         ----                                   ------------------------------------
encode/dict/65536x1           1.01    271.4±1.78µs   926.3 MB/sec    1.00    267.9±0.66µs   938.6 MB/sec
encode/dict/65536x8           1.00      5.1±0.12ms   395.2 MB/sec    1.11      5.6±0.21ms   357.0 MB/sec
encode/dict/8192x1            1.04     36.4±0.04µs   896.4 MB/sec    1.00     35.0±0.03µs   932.9 MB/sec
encode/dict/8192x8            1.06    305.3±1.34µs   855.9 MB/sec    1.00    287.6±1.80µs   908.7 MB/sec
encode/fixed/65536x1          1.04     10.3±0.02µs    47.6 GB/sec    1.00      9.9±0.02µs    49.3 GB/sec
encode/fixed/65536x8          1.00   1084.9±6.70µs     3.6 GB/sec    1.01   1096.8±4.18µs     3.6 GB/sec
encode/fixed/8192x1           1.00      3.0±0.01µs    20.3 GB/sec    1.03      3.1±0.01µs    19.7 GB/sec
encode/fixed/8192x8           1.06     17.4±0.02µs    28.0 GB/sec    1.00     16.4±0.06µs    29.8 GB/sec
encode/nested/65536x1         1.29     38.5±0.19µs    31.7 GB/sec    1.00     29.8±0.58µs    41.0 GB/sec
encode/nested/65536x8         1.32      2.7±0.04ms     3.6 GB/sec    1.00      2.1±0.20ms     4.8 GB/sec
encode/nested/8192x1          1.01      5.8±0.01µs    26.2 GB/sec    1.00      5.8±0.02µs    26.4 GB/sec
encode/nested/8192x8          1.04     47.3±0.08µs    25.8 GB/sec    1.00     45.5±0.14µs    26.9 GB/sec
encode/variable/65536x1       1.63     80.5±0.47µs    27.3 GB/sec    1.00     49.3±0.34µs    44.6 GB/sec
encode/variable/65536x8       1.10      5.4±0.19ms     3.2 GB/sec    1.00      4.9±0.11ms     3.6 GB/sec
encode/variable/8192x1        1.81     10.7±0.01µs    25.7 GB/sec    1.00      5.9±0.01µs    46.6 GB/sec
encode/variable/8192x8        1.37     87.5±0.23µs    25.1 GB/sec    1.00     63.9±0.26µs    34.4 GB/sec
roundtrip/dict/65536x1        1.00  1319.9±43.70µs   190.5 MB/sec    1.00  1316.4±41.25µs   191.0 MB/sec
roundtrip/dict/65536x8        1.00     14.4±0.59ms   139.9 MB/sec    1.02     14.6±0.57ms   137.9 MB/sec
roundtrip/dict/8192x1         1.01    213.3±5.87µs   153.1 MB/sec    1.00    211.4±6.11µs   154.5 MB/sec
roundtrip/dict/8192x8         1.01  1346.6±44.16µs   194.0 MB/sec    1.00  1338.9±42.85µs   195.1 MB/sec
roundtrip/fixed/65536x1       1.00    318.9±4.39µs  1568.0 MB/sec    1.00    319.6±4.24µs  1564.8 MB/sec
roundtrip/fixed/65536x8       1.00      2.2±0.03ms  1821.4 MB/sec    1.19      2.6±0.16ms  1532.6 MB/sec
roundtrip/fixed/8192x1        1.00     93.5±1.03µs   669.7 MB/sec    1.01     94.6±1.56µs   661.8 MB/sec
roundtrip/fixed/8192x8        1.01    342.0±6.09µs  1464.3 MB/sec    1.00    338.7±4.37µs  1478.2 MB/sec
roundtrip/nested/65536x1      1.01   885.1±37.56µs  1412.4 MB/sec    1.00   879.6±39.35µs  1421.4 MB/sec
roundtrip/nested/65536x8      1.00     10.0±0.39ms  1002.2 MB/sec    1.02     10.2±0.43ms   979.4 MB/sec
roundtrip/nested/8192x1       1.01    163.0±5.56µs   960.0 MB/sec    1.00    161.1±5.13µs   971.1 MB/sec
roundtrip/nested/8192x8       1.00   927.1±39.57µs  1350.1 MB/sec    1.02   941.8±46.61µs  1329.0 MB/sec
roundtrip/variable/65536x1    1.00  1285.3±51.72µs  1750.7 MB/sec    1.49  1910.5±128.29µs  1177.8 MB/sec
roundtrip/variable/65536x8    1.00     14.8±0.74ms  1217.3 MB/sec    1.01     15.0±0.53ms  1199.3 MB/sec
roundtrip/variable/8192x1     1.00    210.4±5.49µs  1337.6 MB/sec    1.00    209.4±6.81µs  1344.0 MB/sec
roundtrip/variable/8192x8     1.00  1251.7±29.66µs  1798.6 MB/sec    1.53  1914.9±122.75µs  1175.7 MB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 345.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 352.2s
CPU sys 71.6s
Peak spill 0 B

branch

Metric Value
Wall time 340.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 333.1s
CPU sys 84.7s
Peak spill 0 B

File an issue against this benchmark runner

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 4, 2026

🤔 results look good, Im curious as to why two of the roundtrip benchmarks were slightly slower even thought encode() is faster across the board.

Looks like the results are reproducable -- next step would be to profile it to see if you can find the answer

I wonder if we are missing a Vec::with_capacity or Vec::reserve to avoid extra allocations / copies 🤔

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

Rich-T-kid commented Jun 4, 2026

i'm suspecting the issue has to do with
let (client, server) = tokio::io::duplex(1024 * 1024);
per the docs "The max_buf_size argument is the maximum amount of bytes that can be written to a side before the write returns Poll::Pending."

The two regression cases both involve large variable-length data where the encoded payload can be huge:
roundtrip/variable/8192x8 — 8 columns × 8192 rows
roundtrip/variable/65536x1 — 65536 rows, large values buffer

This also shows up in the regression cases,
roundtrip/variable/8192x8 1.00 1251.7±29.66µs 1798.6 MB/sec 1.53 1914.9±122.75µs 1175.7 MB/sec
roundtrip/variable/65536x1 1.00 1285.3±51.72µs 1750.7 MB/sec 1.49 1910.5±128.29µs 1177.8 MB/sec
throughput falls flat.

taking a look at the other benchmark results this seems consistant,
roundtrip/fixed/65536x8 1.00 2.2±0.03ms 1821.4 MB/sec 1.19 2.6±0.16ms 1532.6 MB/sec throughput shrinks and as such causes more blocking to happen.

Even in the event where this isn't the reason for the slow down I think 1MB is still to small for realistic max throughput.

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

I wonder if we are missing a Vec::with_capacity or Vec::reserve to avoid extra allocations / copies

I think this is a strong possibility after looking at the profile, from my understanding this is mostly in arrow-flight itself and not arrow-ipc. Since arrow-flight is very dependent on arrow-ipc it make sense to start from the ground up with these optimizations.

alamb pushed a commit that referenced this pull request Jun 4, 2026
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->
- Closes #10029.

# Rationale for this change
Increase the duplex buffer from 1 MB to 64 MB to eliminate artificial
back-pressure in the roundtrip benchmarks.
See rational in this
[comment](#10044 (comment))
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

# What changes are included in this PR?
bumps `max_buf_size` to 64**MB**
<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

# Are these changes tested?
n/a
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

If this PR claims a performance improvement, please include evidence
such as benchmark results.
-->

# Are there any user-facing changes?
n/a
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->
@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4625207879-445-bfd2s 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (7acd810) to d7ef673 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                         main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                         ----                                   ------------------------------------
encode/dict/65536x1           1.05    281.7±0.67µs   892.5 MB/sec    1.00    269.4±1.07µs   933.2 MB/sec
encode/dict/65536x8           1.17      7.3±0.12ms   274.6 MB/sec    1.00      6.3±0.10ms   320.3 MB/sec
encode/dict/8192x1            1.00     34.9±0.04µs   934.5 MB/sec    1.00     34.8±0.03µs   938.7 MB/sec
encode/dict/8192x8            1.05    298.5±1.72µs   875.4 MB/sec    1.00    284.1±0.62µs   919.7 MB/sec
encode/fixed/65536x1          1.00     10.0±0.02µs    48.9 GB/sec    1.00     10.0±0.02µs    49.0 GB/sec
encode/fixed/65536x8          1.00   1003.8±2.89µs     3.9 GB/sec    1.05   1058.5±2.02µs     3.7 GB/sec
encode/fixed/8192x1           1.01      3.1±0.01µs    19.4 GB/sec    1.00      3.1±0.01µs    19.7 GB/sec
encode/fixed/8192x8           1.02     16.9±0.03µs    28.9 GB/sec    1.00     16.6±0.03µs    29.4 GB/sec
encode/nested/65536x1         1.00     28.1±0.14µs    43.4 GB/sec    1.06     29.7±0.53µs    41.1 GB/sec
encode/nested/65536x8         1.01      3.0±0.03ms     3.3 GB/sec    1.00      2.9±0.02ms     3.3 GB/sec
encode/nested/8192x1          1.00      5.8±0.01µs    26.5 GB/sec    1.00      5.7±0.01µs    26.7 GB/sec
encode/nested/8192x8          1.06     48.7±0.10µs    25.1 GB/sec    1.00     46.0±0.40µs    26.5 GB/sec
encode/variable/65536x1       1.50     73.9±0.38µs    29.7 GB/sec    1.00     49.2±0.16µs    44.6 GB/sec
encode/variable/65536x8       1.21      5.3±0.07ms     3.3 GB/sec    1.00      4.4±0.06ms     4.0 GB/sec
encode/variable/8192x1        1.18      7.0±0.01µs    39.3 GB/sec    1.00      5.9±0.01µs    46.5 GB/sec
encode/variable/8192x8        1.32     83.4±0.15µs    26.4 GB/sec    1.00     63.2±0.21µs    34.8 GB/sec
roundtrip/dict/65536x1        1.00  1279.6±46.49µs   196.5 MB/sec    1.00  1279.9±43.65µs   196.4 MB/sec
roundtrip/dict/65536x8        1.03     15.7±0.52ms   127.7 MB/sec    1.00     15.3±0.49ms   131.1 MB/sec
roundtrip/dict/8192x1         1.00    205.5±5.41µs   159.0 MB/sec    1.00    205.1±5.68µs   159.3 MB/sec
roundtrip/dict/8192x8         1.00  1307.2±44.46µs   199.9 MB/sec    1.01  1317.0±53.52µs   198.4 MB/sec
roundtrip/fixed/65536x1       1.00    303.4±5.86µs  1648.1 MB/sec    1.02    308.6±4.48µs  1620.3 MB/sec
roundtrip/fixed/65536x8       1.00      2.2±0.03ms  1857.5 MB/sec    1.52      3.3±0.19ms  1224.0 MB/sec
roundtrip/fixed/8192x1        1.02     90.2±1.71µs   694.0 MB/sec    1.00     88.4±1.47µs   707.7 MB/sec
roundtrip/fixed/8192x8        1.00    324.8±3.20µs  1541.5 MB/sec    1.00    326.3±3.73µs  1534.4 MB/sec
roundtrip/nested/65536x1      1.00   832.1±42.06µs  1502.4 MB/sec    1.02   844.9±46.87µs  1479.6 MB/sec
roundtrip/nested/65536x8      1.00      9.5±0.62ms  1051.7 MB/sec    1.03      9.8±0.51ms  1023.3 MB/sec
roundtrip/nested/8192x1       1.01    157.5±5.94µs   993.3 MB/sec    1.00    156.1±6.26µs  1002.2 MB/sec
roundtrip/nested/8192x8       1.00   894.7±45.10µs  1399.0 MB/sec    1.00   897.4±45.96µs  1394.7 MB/sec
roundtrip/variable/65536x1    1.00  1201.3±32.52µs  1873.1 MB/sec    1.42  1700.7±124.37µs  1323.1 MB/sec
roundtrip/variable/65536x8    1.08     16.0±0.87ms  1127.4 MB/sec    1.00     14.7±0.57ms  1222.9 MB/sec
roundtrip/variable/8192x1     1.02    204.2±5.80µs  1378.3 MB/sec    1.00    200.9±5.67µs  1400.5 MB/sec
roundtrip/variable/8192x8     1.00  1208.2±29.51µs  1863.4 MB/sec    1.29  1553.8±151.44µs  1449.0 MB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 335.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 340.4s
CPU sys 74.4s
Peak spill 0 B

branch

Metric Value
Wall time 330.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 325.2s
CPU sys 84.2s
Peak spill 0 B

File an issue against this benchmark runner

@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

Rich-T-kid commented Jun 5, 2026

the benchmarks look closer for roundtrip/variable on my local machine than it is when @adriangbot runs them. This leads me to believe it has something to do with the size of hardware like cpu caches.
Image 6-4-26 at 8 42 PM
its hard to verify whats causing improvements/regressions when my local benchmarks are showing different characteristics.
updated the PR, can we run the benchmarks again? If this is still causing a regression I think it makes sense to look into the arrow-flight crate.

@gabotechs
Copy link
Copy Markdown
Contributor

run benchmarks flight

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4628731195-448-n8w99 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (86f4e34) to d7ef673 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                         main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                         ----                                   ------------------------------------
encode/dict/65536x1           1.00    268.7±0.46µs   935.5 MB/sec    1.00    269.2±0.76µs   934.0 MB/sec
encode/dict/65536x8           1.00      5.5±0.03ms   362.7 MB/sec    1.53      8.5±0.07ms   237.4 MB/sec
encode/dict/8192x1            1.00     35.3±0.03µs   925.3 MB/sec    1.00     35.2±0.03µs   927.1 MB/sec
encode/dict/8192x8            1.04    298.1±0.58µs   876.7 MB/sec    1.00    285.9±0.78µs   914.1 MB/sec
encode/fixed/65536x1          1.02     10.0±0.02µs    49.0 GB/sec    1.00      9.8±0.01µs    49.8 GB/sec
encode/fixed/65536x8          1.03   1130.5±2.04µs     3.5 GB/sec    1.00   1099.5±2.72µs     3.6 GB/sec
encode/fixed/8192x1           1.00      3.0±0.01µs    20.3 GB/sec    1.03      3.1±0.01µs    19.7 GB/sec
encode/fixed/8192x8           1.02     17.4±0.05µs    28.1 GB/sec    1.00     17.1±0.03µs    28.7 GB/sec
encode/nested/65536x1         1.25     38.4±0.18µs    31.8 GB/sec    1.00     30.7±0.45µs    39.7 GB/sec
encode/nested/65536x8         1.00      2.8±0.01ms     3.5 GB/sec    1.08      3.0±0.01ms     3.2 GB/sec
encode/nested/8192x1          1.00      5.8±0.01µs    26.2 GB/sec    1.00      5.8±0.01µs    26.2 GB/sec
encode/nested/8192x8          1.04     48.1±0.12µs    25.4 GB/sec    1.00     46.2±0.62µs    26.5 GB/sec
encode/variable/65536x1       1.50     73.8±0.24µs    29.8 GB/sec    1.00     49.3±0.21µs    44.6 GB/sec
encode/variable/65536x8       1.04      5.0±0.03ms     3.5 GB/sec    1.00      4.8±0.02ms     3.7 GB/sec
encode/variable/8192x1        1.19      6.9±0.01µs    39.8 GB/sec    1.00      5.8±0.01µs    47.4 GB/sec
encode/variable/8192x8        1.30     82.6±0.18µs    26.6 GB/sec    1.00     63.5±0.17µs    34.6 GB/sec
roundtrip/dict/65536x1        1.01  1292.0±46.36µs   194.6 MB/sec    1.00  1284.5±44.63µs   195.7 MB/sec
roundtrip/dict/65536x8        1.08     15.5±0.55ms   129.4 MB/sec    1.00     14.4±0.48ms   139.3 MB/sec
roundtrip/dict/8192x1         1.00    210.9±5.84µs   154.9 MB/sec    1.00    210.5±5.31µs   155.2 MB/sec
roundtrip/dict/8192x8         1.01  1346.8±45.92µs   194.0 MB/sec    1.00  1336.1±41.61µs   195.6 MB/sec
roundtrip/fixed/65536x1       1.00    315.5±4.56µs  1585.0 MB/sec    1.00    316.1±4.30µs  1581.9 MB/sec
roundtrip/fixed/65536x8       1.00      2.2±0.03ms  1815.6 MB/sec    1.37      3.0±0.22ms  1326.9 MB/sec
roundtrip/fixed/8192x1        1.00     94.9±0.97µs   659.8 MB/sec    1.00     95.0±1.18µs   659.1 MB/sec
roundtrip/fixed/8192x8        1.00    333.9±3.94µs  1499.6 MB/sec    1.00    333.1±3.84µs  1503.1 MB/sec
roundtrip/nested/65536x1      1.01   859.2±44.16µs  1455.0 MB/sec    1.00   848.1±41.92µs  1474.1 MB/sec
roundtrip/nested/65536x8      1.00      9.5±0.61ms  1055.8 MB/sec    1.06     10.0±0.59ms   997.1 MB/sec
roundtrip/nested/8192x1       1.02    163.5±5.51µs   956.8 MB/sec    1.00    160.6±5.81µs   974.2 MB/sec
roundtrip/nested/8192x8       1.00   908.7±42.46µs  1377.3 MB/sec    1.00   906.2±42.33µs  1381.2 MB/sec
roundtrip/variable/65536x1    1.00  1243.2±34.18µs  1810.0 MB/sec    1.52  1888.0±110.52µs  1191.8 MB/sec
roundtrip/variable/65536x8    1.05     16.9±0.65ms  1064.7 MB/sec    1.00     16.1±0.55ms  1117.4 MB/sec
roundtrip/variable/8192x1     1.01    209.6±5.84µs  1342.8 MB/sec    1.00    207.1±6.04µs  1358.7 MB/sec
roundtrip/variable/8192x8     1.00  1240.0±26.08µs  1815.6 MB/sec    1.42  1762.6±112.62µs  1277.3 MB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 340.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 340.0s
CPU sys 76.8s
Peak spill 0 B

branch

Metric Value
Wall time 330.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 319.2s
CPU sys 86.0s
Peak spill 0 B

File an issue against this benchmark runner

@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from 86f4e34 to 7acd810 Compare June 5, 2026 13:55
@alamb alamb changed the title Rich t kid/optimize arrow ipc copies Avoid Arrow IPC Copies Jun 5, 2026
@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

Rich-T-kid commented Jun 5, 2026

I've been working on this all day and any further changes risk scope creep, so I'd like to split the IPC StreamWriter/FileWriter improvements and the arrow-flight work into separate PRs. since benchmarks for those look good & they are unrelated to the arrow-flight work.
Due to async polling it's hard to distinguish what copies happen at the tonic level versus in arrow-flight. I've been profiling this locally with an additional encode_to_send benchmark that measures the full path via do_put [will include in follow up PR] @alamb does that sound like a good idea?
Image 6-5-26 at 4 58 PM
Image 6-5-26 at 4 57 PM (1)
Image 6-5-26 at 4 57 PM
One question: does anyone have insight into why the benchmarks behave differently on the CI workers versus locally? I saw something similar in distributed DataFusion and it came down to thread count, but beyond thread count and CPU cache size I'm not sure what else could explain the difference at this scale.

@github-actions github-actions Bot added the arrow-flight Changes to the arrow-flight crate label Jun 7, 2026
@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

Image 6-7-26 at 4 16 PM results here seem in line with [this](https://github.com//pull/10044#issuecomment-4621838941)

I removed any logic that was touching arrow-flights path. This PR focuses on removing intermediary buffer allocations for the StreamWriter and FIleWriter. Instead of accumulating all buffer bytes into a heap allocation before writing, buffer pointers are collected during encoding and streamed directly to the underlying writer once the FlatBuffer header is built.

@github-actions github-actions Bot removed the arrow-flight Changes to the arrow-flight crate label Jun 7, 2026
Comment thread arrow-ipc/src/writer.rs
@Rich-T-kid
Copy link
Copy Markdown
Contributor Author

The diff looks larger than it is due to estimate_encoded_buffer_count() (which estimates how many slots to pre-allocate for the buffer vector) and the if statements that determine whether write_buffers() or collect_encoded_buffers() is called based on whether a Writer was supplied. This avoids duplicating logic by reusing shared functionality across both code paths.
@gabotechs could you take another look at this when you get a chance 🚀

Comment thread arrow-ipc/src/writer.rs
@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from cbc1d46 to ce6b828 Compare June 8, 2026 00:03
@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/optimize-arrow-ipc-copies branch from ce6b828 to 095d7fd Compare June 8, 2026 00:05
@gabotechs
Copy link
Copy Markdown
Contributor

run benchmarks flight

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4645915589-475-qqxw6 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/optimize-arrow-ipc-copies (a3f9c53) to d7ef673 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                         main                                   rich-T-kid_optimize-arrow-ipc-copies
-----                         ----                                   ------------------------------------
encode/dict/65536x1           1.00    269.0±1.27µs   934.5 MB/sec    1.05    282.8±1.13µs   889.2 MB/sec
encode/dict/65536x8           1.00      4.0±0.03ms   499.8 MB/sec    1.43      5.7±0.03ms   350.5 MB/sec
encode/dict/8192x1            1.00     35.5±0.43µs   920.9 MB/sec    1.03     36.6±0.04µs   891.5 MB/sec
encode/dict/8192x8            1.01    306.2±4.60µs   853.4 MB/sec    1.00    301.9±2.32µs   865.4 MB/sec
encode/fixed/65536x1          1.00      9.9±0.02µs    49.5 GB/sec    1.03     10.1±0.02µs    48.3 GB/sec
encode/fixed/65536x8          1.03   1121.5±1.42µs     3.5 GB/sec    1.00   1087.5±2.55µs     3.6 GB/sec
encode/fixed/8192x1           1.00      3.2±0.01µs    19.3 GB/sec    1.00      3.2±0.01µs    19.4 GB/sec
encode/fixed/8192x8           1.00     17.6±0.03µs    27.7 GB/sec    1.01     17.8±0.05µs    27.5 GB/sec
encode/nested/65536x1         1.00     37.5±0.19µs    32.6 GB/sec    1.15     42.9±0.36µs    28.4 GB/sec
encode/nested/65536x8         1.15      3.0±0.01ms     3.3 GB/sec    1.00      2.6±0.01ms     3.8 GB/sec
encode/nested/8192x1          1.00      5.8±0.01µs    26.4 GB/sec    1.02      5.9±0.01µs    26.0 GB/sec
encode/nested/8192x8          1.00     46.5±0.08µs    26.3 GB/sec    1.05     49.1±0.13µs    24.9 GB/sec
encode/variable/65536x1       1.00     72.7±0.39µs    30.2 GB/sec    1.12     81.5±0.33µs    27.0 GB/sec
encode/variable/65536x8       1.00      5.0±0.03ms     3.5 GB/sec    1.12      5.6±0.05ms     3.1 GB/sec
encode/variable/8192x1        1.00      6.8±0.02µs    40.5 GB/sec    1.03      7.0±0.01µs    39.1 GB/sec
encode/variable/8192x8        1.00     80.9±0.16µs    27.2 GB/sec    1.02     82.7±0.21µs    26.6 GB/sec
roundtrip/dict/65536x1        1.00  1281.9±43.73µs   196.1 MB/sec    1.01  1300.0±43.37µs   193.4 MB/sec
roundtrip/dict/65536x8        1.02     14.7±0.61ms   137.1 MB/sec    1.00     14.3±0.60ms   140.4 MB/sec
roundtrip/dict/8192x1         1.00    204.0±5.97µs   160.1 MB/sec    1.03    209.5±5.67µs   155.9 MB/sec
roundtrip/dict/8192x8         1.00  1333.1±41.92µs   196.0 MB/sec    1.01  1350.3±43.61µs   193.5 MB/sec
roundtrip/fixed/65536x1       1.00    311.1±3.95µs  1607.4 MB/sec    1.01    315.5±3.73µs  1585.1 MB/sec
roundtrip/fixed/65536x8       1.00      2.1±0.02ms  1873.4 MB/sec    1.01      2.1±0.03ms  1861.8 MB/sec
roundtrip/fixed/8192x1        1.00     91.2±1.40µs   686.4 MB/sec    1.00     91.4±0.86µs   684.5 MB/sec
roundtrip/fixed/8192x8        1.00    328.6±4.67µs  1524.1 MB/sec    1.02    336.4±3.39µs  1488.6 MB/sec
roundtrip/nested/65536x1      1.00   849.6±40.97µs  1471.5 MB/sec    1.01   857.6±41.57µs  1457.7 MB/sec
roundtrip/nested/65536x8      1.00      8.3±0.34ms  1200.9 MB/sec    1.16      9.7±0.36ms  1034.4 MB/sec
roundtrip/nested/8192x1       1.00    158.6±5.50µs   986.4 MB/sec    1.01    159.9±5.60µs   978.3 MB/sec
roundtrip/nested/8192x8       1.00   904.4±45.78µs  1383.9 MB/sec    1.00   904.4±43.74µs  1383.9 MB/sec
roundtrip/variable/65536x1    1.00  1212.7±41.75µs  1855.5 MB/sec    1.01  1224.4±31.76µs  1837.8 MB/sec
roundtrip/variable/65536x8    1.00     16.1±0.50ms  1120.5 MB/sec    1.02     16.3±0.57ms  1101.1 MB/sec
roundtrip/variable/8192x1     1.00    204.7±5.69µs  1374.6 MB/sec    1.02    209.1±5.61µs  1345.9 MB/sec
roundtrip/variable/8192x8     1.00  1211.3±21.91µs  1858.7 MB/sec    1.01  1219.1±27.99µs  1846.7 MB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 335.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 348.5s
CPU sys 65.9s
Peak spill 0 B

branch

Metric Value
Wall time 340.1s
Peak memory 3.4 GiB
Avg memory 3.4 GiB
CPU user 346.0s
CPU sys 71.2s
Peak spill 0 B

File an issue against this benchmark runner

Comment thread arrow-ipc/src/writer.rs
ipc_message: finished_data.to_vec(),
arrow_data,
})
if let Some(w) = writer {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find another way of modeling this without falling into "if-driven-development" patterns.

Whenever you find yourself coding something switching execution branches with completely different bodies over booleans, it's an indicator that you are falling into an "if-driven-development" pattern, and this is will greatly hurt maintenance in the future.

One way of trying to improve this situation can be:

  1. Refactor things preferring code duplication, and split into different functions with different responsibilities, even if that implies copy-pasting big quantities of code.
  2. Once you have the different functions with a fair amount of copy-pasted code, try to see what are the common bits, and factor them out little by little, trying to progressively reduce the LOC count in each one.
  3. If you still see functions that need to accept bool or Option parameters that have the capability of switching how the function behaves, that means the function was the wrong abstraction on the first place, so don't be afraid of tearing existing functions apart if that allows you to reuse smaller bits in different places without switching on if statements.

Comment thread arrow-ipc/src/writer.rs
Comment on lines 1943 to 1954
arrow_data: &mut Vec<u8>,
encoded_buffers: &mut Vec<EncodedBuffer>,
nodes: &mut Vec<crate::FieldNode>,
offset: i64,
num_rows: usize,
null_count: usize,
compression_codec: Option<CompressionCodec>,
compression_context: &mut CompressionContext,
write_options: &IpcWriteOptions,
is_direct: bool,
) -> Result<i64, ArrowError> {
let mut offset = offset;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another example of my comment above, but this time reinforced by a big Clippy bypass at the top.

Ideally, we should not be contributing towards stronger Clippy violations. If you try to follow the steps above, there are good chances that we can either maintain the width of this function's signature, or even reduce it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize arrow-ipc

4 participants