Skip to content

perf: Add support for GroupsAccumulator to string_agg#21154

Open
neilconway wants to merge 5 commits intoapache:mainfrom
neilconway:neilc/optimize-string-agg
Open

perf: Add support for GroupsAccumulator to string_agg#21154
neilconway wants to merge 5 commits intoapache:mainfrom
neilconway:neilc/optimize-string-agg

Conversation

@neilconway
Copy link
Contributor

@neilconway neilconway commented Mar 25, 2026

Which issue does this PR close?

Rationale for this change

string_agg previously didn't support the GroupsAccumulator API; adding support for it can significantly improve performance, particularly when there are many groups.

Benchmarks (M4 Max):

  • string_agg_query_group_by_few_groups (~10): 645 µs → 564 µs, -11%
  • string_agg_query_group_by_mid_groups (~1,000): 2,692 µs → 871 µs, -68%
  • string_agg_query_group_by_many_groups (~65,000): 16,606 µs → 1,147 µs, -93%

What changes are included in this PR?

  • Add end-to-end benchmark for string_agg
  • Implement GroupsAccumulator API for string_agg
  • Add unit tests
  • Minor code cleanup for existing string_agg code paths

Are these changes tested?

Yes.

Are there any user-facing changes?

No, other than a change to an error message string.

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 25, 2026
delimiter: String,
/// Accumulated string per group. `None` means no values have been seen
/// (the group's output will be NULL).
values: Vec<Option<String>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps he values can be collected first to a single buffer first and collected afterwards (from offsets / lengths).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! This could work, although it ends up making the partial-emit / space reclamation logic a lot more complicated.

If we're going to take on more complexity, we could go further and avoid copying the input string during update_batch; just bump the Arc refcount on the input batch and keep <group_id, batch_id, row_id> triples. Then assemble the actual results in evaluate() (this is similar to #20504 for array_agg). This would be quite a bit more complicated than this PR, but it could be worth it to reduce the amount of data being copied. I opened #21156 for this idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then assemble the actual results in evaluate() (this is similar to #20504 for array_agg). This would be quite a bit more complicated than this PR,

We also need to ensure we aren't keeping around too much "garbage" (memory) if we go this approach as well

@alamb
Copy link
Contributor

alamb commented Mar 25, 2026

Locally, I was also able to reproduce about a 50% speedup

Create 100 scale dataset

tpchgen-cli --format parquet --scale-factor=100 --tables partsupp

main:

> select ps_partkey, string_agg(ps_comment, ';') from 'partsupp.parquet' group by ps_partkey;

20000000 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 10.798 seconds.

This branch

andrewlamb@Andrews-MacBook-Pro-3:~/Downloads$ ./datafusion-cli-neilc_optimize-string-agg
DataFusion CLI v52.3.0
> select ps_partkey, string_agg(ps_comment, ';') from 'partsupp.parquet' group by ps_partkey;
...
20000000 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 6.600 seconds.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me @neilconway -- thank you. The groups accumulator you have created is really nice and easy to review

While @Dandandan 's idea to improve things more is a good one, I think given this approach improves performance already we can merge it as is

delimiter: String,
/// Accumulated string per group. `None` means no values have been seen
/// (the group's output will be NULL).
values: Vec<Option<String>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then assemble the actual results in evaluate() (this is similar to #20504 for array_agg). This would be quite a bit more complicated than this PR,

We also need to ensure we aren't keeping around too much "garbage" (memory) if we go this approach as well

@alamb alamb added the performance Make DataFusion faster label Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate functions Changes to functions implementation performance Make DataFusion faster sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

string_agg aggregate function is 1000x slower than duckdb (SQLStorm)

3 participants