-
Notifications
You must be signed in to change notification settings - Fork 7
Efficient distinct operator #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…n element occurs multiple times in the input stream
|
I'll take a proper look later when home, but looks good. Does this dedupe per-key? So to dedupe the all you key by a common key first? |
|
Not a problem if it doesn't, we can always add a version that does if we want to later in. |
It dedupes by the entire value unless you pass an extractor function in which case it dedupes by the return value of that function. The extractor function pattern should be general enough to handle any case. So, if you want to dedup by key you would do |
samwillis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! ![]()
* Replace distinct operator atop reduce by a dedicated more efficient distinct operator * Add additional unit tests for distinct operator * Fix bug where distinct wasn't properly summing the multiplicites if an element occurs multiple times in the input stream * Extend distinct with optional argument to determine what to deduplicate by. * Formatting * Small change to test * changest --------- Co-authored-by: Sam Willis <sam.willis@gmail.com>
This PR replaces the existing
distinctoperator which was built on top of thereduceoperator with a more efficient implementation. This newdistinctoperator supports an optional argument that is a function(value: T) => anyto determine what to deduplicate on. For example, we can do:Note that the output has the same type as the input, the function is only used to determine what to deduplicate on but it is not used to transform the values. This function is very useful for ts/db where we want to get distinct results based on the selected columns but we still want the entire selected rows to come out of the stream.