fix: remove package fuzzyset in favor of fuzzystrmatch-pg#4837
fix: remove package fuzzyset in favor of fuzzystrmatch-pg#4837taimoorzaeem wants to merge 1 commit intoPostgREST:mainfrom
Conversation
|
The slight performance drop in postgrest/test/load/errors.http Lines 9 to 11 in de4ad62 |
WIP Signed-off-by: Taimoor Zaeem <taimoorzaeem@gmail.com>
f4ca00a to
0ff70ab
Compare
Hm, looks like perf loss is ~26% while we also use more CPU and MEM.
🤔 Could you summarize why? Would it be possible to generate a hint with the current implementation to compare more accurately? |
| -- | | ||
| -- Check Levenshtein Distance and return hint lower than max distance | ||
| checkLevenshteinDistance :: Text -> [Text] -> Int -> Maybe Text | ||
| checkLevenshteinDistance _ [] _ = Nothing | ||
| checkLevenshteinDistance identName (suggest:suggests) dist = | ||
| case Fuzzy.levenshteinLessEqual identName suggest dist of | ||
| Just _ -> Just suggest | ||
| Nothing -> checkLevenshteinDistance identName suggests dist | ||
|
|
||
| -- | | ||
| -- Check Levenshtein Distance and return hint with minimum distance | ||
| checkMinimumLevenshteinDistance :: Text -> [Text] -> Maybe Text -> Int -> Maybe Text | ||
| checkMinimumLevenshteinDistance _ [] Nothing _ = Nothing | ||
| checkMinimumLevenshteinDistance _ [] (Just suggest) _ = Just suggest | ||
| checkMinimumLevenshteinDistance identName (suggest:suggests) currentSuggest minDist = | ||
| let dist = Fuzzy.levenshtein identName suggest | ||
| in if dist < minDist | ||
| then | ||
| if dist == 0 && hintType == HintRelChildren -- Do not give suggestion if the child is found in the relations (dist = 0) | ||
| then checkMinimumLevenshteinDistance identName suggests currentSuggest minDist -- Go with current suggestion | ||
| else checkMinimumLevenshteinDistance identName suggests (Just suggest) dist -- Update suggestion | ||
| else checkMinimumLevenshteinDistance identName suggests currentSuggest minDist -- Go with current suggestion |
There was a problem hiding this comment.
Could the library somehow include this logic? It seems a bit long and related to the algorithm.
Yeah, I think first off all we should modify this target request to generate error with hint, the comparison would be more clear after that. In hindsight, I think it used to generate the hint, but after #4684, it doesn't and went unnoticed. |
|
I'm not sure I understand why this is created as a separate library, if its only use is for PostgREST and it will have to be maintained by the team anyway. Why not have the code in this repo directly, if we're not using an existing library? |
| | NoRpc Text Text [Text] MediaType Bool [QualifiedIdentifier] [Routine] | ||
| | ColumnNotFound Text Text | ||
| | TableNotFound Text Text SchemaCache | ||
| | TableNotFound Text Text [Table] |
There was a problem hiding this comment.
See our discussion here:
https://github.com/PostgREST/postgrest/pull/4472/changes#r2596118601
We decided to have SchemaCache here to avoid constant back-and-forth API changes every time the hint calculation algorithm changes (which this edit confirms is a problem :) ).
I'd leave SchemaCache here. There would be no changes in Response and Plan needed - less code churn.
Code in @taimoorzaeem - I understand the algorithm to find a hint is linear in the size of the tables list? I suspect there is something off with our tests right now and IMHO we need to make sure we test this functionality correctly before doing any changes to the implementation. |
Well... @taimoorzaeem decided to implement a library that he wishes to maintain as an OSS project (which he's a sole copyright owner from what I read in the LICENSE file). I don't think that's any different from any other third-party dependency we have (other than @taimoorzaeem is also a contributor to PostgREST). |
|
I opened this PR to compare with the loadtest results. Obviously it's in no shape for merging as yet. I recently found out about more ways to optimize the algorithm using BK-trees which I might try. I usually try to work on this library slowly in separate time.
I think this much code would be very invasive in PostgREST source and that it should be maintained in a separate repo. Maybe in the future we can add more algorithms from https://www.postgresql.org/docs/current/fuzzystrmatch.html. I also think that as a package it might be useful for others in the open source community (although I am skeptical that happens anytime soon).
How about we move this repo under |
Well, moving from an unmaintained library to a just released new library with a single contributor and no track record of ongoing maintenance (no offense, Taimoor!) does not look like the right move from the project's perspective. But that's not my primary concern. My primary concern is, that we have other such libraries already in place:
We already have overhead and friction, because we need to keep these packages building alongside our own, but can't move as freely as we could otherwise. From practical experience, I don't see any advantage by having
To be clear, my primary concern is not who is maintaining it, but whether it's mono-repo or a separate package. |
I agree.
I would go in the opposite direction - reuse established configuration language and libraries from the market instead of re-inventing something PostgREST specific (see also below).
The flip side being the growing amount of non PostgREST specific code needed to maintain in PostgREST. Including some generic libraries (like Levenshtein distance implementation or sieve cache) in PostgREST incurs long term costs and risks. I'm afraid it is trade-offs all the way down - there is a natural tension here, especially in case of |
Sure, but this argument only works when other, well maintained options are available. Which was clearly not the case for The point is: The amount of code to maintain for us is the same, if we're effectively the only users of these libraries. So we might as well make it easier for us to maintain them. But yes, I agree: If we can use a well maintained and well used library from hackage, that's even better.
This has been suggested before, but we were reluctant to make that breaking change, so far. |
Discussed on #3101 (comment). |
WIP. Remove package fuzzyset in favor of fuzzystrmatch-pg. Closes #4686.
fuzzysetand replace withfuzzystrmatch-pg