perf: Add support for `GroupsAccumulator` to `string_agg` by neilconway · Pull Request #21154 · apache/datafusion

neilconway · 2026-03-25T17:24:05Z

Which issue does this PR close?

Closes string_agg aggregate function is 1000x slower than duckdb (SQLStorm) #17789.

Rationale for this change

string_agg previously didn't support the GroupsAccumulator API; adding support for it can significantly improve performance, particularly when there are many groups.

Benchmarks (M4 Max):

string_agg_query_group_by_few_groups (~10): 645 µs → 564 µs, -11%
string_agg_query_group_by_mid_groups (~1,000): 2,692 µs → 871 µs, -68%
string_agg_query_group_by_many_groups (~65,000): 16,606 µs → 1,147 µs, -93%

What changes are included in this PR?

Add end-to-end benchmark for string_agg
Implement GroupsAccumulator API for string_agg
Add unit tests
Minor code cleanup for existing string_agg code paths

Are these changes tested?

Yes.

Are there any user-facing changes?

No, other than a change to an error message string.

Dandandan · 2026-03-25T17:55:59Z

datafusion/functions-aggregate/src/string_agg.rs

+    delimiter: String,
+    /// Accumulated string per group. `None` means no values have been seen
+    /// (the group's output will be NULL).
+    values: Vec<Option<String>>,


Perhaps he values can be collected first to a single buffer first and collected afterwards (from offsets / lengths).

Thanks for the suggestion! This could work, although it ends up making the partial-emit / space reclamation logic a lot more complicated.

If we're going to take on more complexity, we could go further and avoid copying the input string during update_batch; just bump the Arc refcount on the input batch and keep <group_id, batch_id, row_id> triples. Then assemble the actual results in evaluate() (this is similar to #20504 for array_agg). This would be quite a bit more complicated than this PR, but it could be worth it to reduce the amount of data being copied. I opened #21156 for this idea.

Then assemble the actual results in evaluate() (this is similar to #20504 for array_agg). This would be quite a bit more complicated than this PR,

We also need to ensure we aren't keeping around too much "garbage" (memory) if we go this approach as well

alamb · 2026-03-25T21:39:44Z

Locally, I was also able to reproduce about a 50% speedup

Create 100 scale dataset

tpchgen-cli --format parquet --scale-factor=100 --tables partsupp

main:

> select ps_partkey, string_agg(ps_comment, ';') from 'partsupp.parquet' group by ps_partkey;

20000000 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 10.798 seconds.

This branch

andrewlamb@Andrews-MacBook-Pro-3:~/Downloads$ ./datafusion-cli-neilc_optimize-string-agg
DataFusion CLI v52.3.0

> select ps_partkey, string_agg(ps_comment, ';') from 'partsupp.parquet' group by ps_partkey;
...
20000000 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 6.600 seconds.

alamb

This looks great to me @neilconway -- thank you. The groups accumulator you have created is really nice and easy to review

While @Dandandan 's idea to improve things more is a good one, I think given this approach improves performance already we can merge it as is

alamb · 2026-03-25T21:41:59Z

datafusion/functions-aggregate/src/string_agg.rs

+    delimiter: String,
+    /// Accumulated string per group. `None` means no values have been seen
+    /// (the group's output will be NULL).
+    values: Vec<Option<String>>,


Then assemble the actual results in evaluate() (this is similar to #20504 for array_agg). This would be quite a bit more complicated than this PR,

We also need to ensure we aren't keeping around too much "garbage" (memory) if we go this approach as well

datafusion/functions-aggregate/src/string_agg.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

…-agg

.

2e20c1a

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 25, 2026

Fix doc link error

d1baf23

Dandandan reviewed Mar 25, 2026

View reviewed changes

alamb approved these changes Mar 25, 2026

View reviewed changes

alamb added the performance Make DataFusion faster label Mar 25, 2026

neilconway and others added 3 commits March 25, 2026 18:05

Update datafusion/functions-aggregate/src/string_agg.rs

26f522f

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

cargo fmt

713aa19

Merge remote-tracking branch 'origin/main' into neilc/optimize-string…

94f23bd

…-agg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Add support for `GroupsAccumulator` to `string_agg`#21154

perf: Add support for `GroupsAccumulator` to `string_agg`#21154
neilconway wants to merge 5 commits intoapache:mainfrom
neilconway:neilc/optimize-string-agg

neilconway commented Mar 25, 2026 •

edited

Loading

Uh oh!

Dandandan Mar 25, 2026

Uh oh!

neilconway Mar 25, 2026

Uh oh!

alamb Mar 25, 2026

Uh oh!

alamb commented Mar 25, 2026

Uh oh!

alamb left a comment

Uh oh!

alamb Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neilconway commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Dandandan Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

neilconway Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 25, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neilconway commented Mar 25, 2026 •

edited

Loading