perf: Add support for GroupsAccumulator to string_agg#21154
perf: Add support for GroupsAccumulator to string_agg#21154neilconway wants to merge 5 commits intoapache:mainfrom
GroupsAccumulator to string_agg#21154Conversation
| delimiter: String, | ||
| /// Accumulated string per group. `None` means no values have been seen | ||
| /// (the group's output will be NULL). | ||
| values: Vec<Option<String>>, |
There was a problem hiding this comment.
Perhaps he values can be collected first to a single buffer first and collected afterwards (from offsets / lengths).
There was a problem hiding this comment.
Thanks for the suggestion! This could work, although it ends up making the partial-emit / space reclamation logic a lot more complicated.
If we're going to take on more complexity, we could go further and avoid copying the input string during update_batch; just bump the Arc refcount on the input batch and keep <group_id, batch_id, row_id> triples. Then assemble the actual results in evaluate() (this is similar to #20504 for array_agg). This would be quite a bit more complicated than this PR, but it could be worth it to reduce the amount of data being copied. I opened #21156 for this idea.
There was a problem hiding this comment.
Then assemble the actual results in evaluate() (this is similar to #20504 for array_agg). This would be quite a bit more complicated than this PR,
We also need to ensure we aren't keeping around too much "garbage" (memory) if we go this approach as well
|
Locally, I was also able to reproduce about a 50% speedup Create 100 scale dataset tpchgen-cli --format parquet --scale-factor=100 --tables partsuppmain: > select ps_partkey, string_agg(ps_comment, ';') from 'partsupp.parquet' group by ps_partkey;
20000000 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 10.798 seconds.This branch andrewlamb@Andrews-MacBook-Pro-3:~/Downloads$ ./datafusion-cli-neilc_optimize-string-agg
DataFusion CLI v52.3.0> select ps_partkey, string_agg(ps_comment, ';') from 'partsupp.parquet' group by ps_partkey;
...
20000000 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 6.600 seconds. |
alamb
left a comment
There was a problem hiding this comment.
This looks great to me @neilconway -- thank you. The groups accumulator you have created is really nice and easy to review
While @Dandandan 's idea to improve things more is a good one, I think given this approach improves performance already we can merge it as is
| delimiter: String, | ||
| /// Accumulated string per group. `None` means no values have been seen | ||
| /// (the group's output will be NULL). | ||
| values: Vec<Option<String>>, |
There was a problem hiding this comment.
Then assemble the actual results in evaluate() (this is similar to #20504 for array_agg). This would be quite a bit more complicated than this PR,
We also need to ensure we aren't keeping around too much "garbage" (memory) if we go this approach as well
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
string_aggaggregate function is 1000x slower than duckdb (SQLStorm) #17789.Rationale for this change
string_aggpreviously didn't support theGroupsAccumulatorAPI; adding support for it can significantly improve performance, particularly when there are many groups.Benchmarks (M4 Max):
What changes are included in this PR?
string_aggGroupsAccumulatorAPI forstring_aggstring_aggcode pathsAre these changes tested?
Yes.
Are there any user-facing changes?
No, other than a change to an error message string.