Skip to content

fix: remove package fuzzyset in favor of fuzzystrmatch-pg#4837

Draft
taimoorzaeem wants to merge 1 commit intoPostgREST:mainfrom
taimoorzaeem:remove/fuzzyset
Draft

fix: remove package fuzzyset in favor of fuzzystrmatch-pg#4837
taimoorzaeem wants to merge 1 commit intoPostgREST:mainfrom
taimoorzaeem:remove/fuzzyset

Conversation

@taimoorzaeem
Copy link
Copy Markdown
Member

@taimoorzaeem taimoorzaeem commented Apr 22, 2026

WIP. Remove package fuzzyset in favor of fuzzystrmatch-pg. Closes #4686.

  • Remove fuzzyset and replace with fuzzystrmatch-pg
  • Update/add tests
  • Compare loadtest results
  • Update docs (if any)
  • Add changelog (add details if breaking)

@taimoorzaeem taimoorzaeem marked this pull request as draft April 22, 2026 09:11
@taimoorzaeem
Copy link
Copy Markdown
Member Author

The slight performance drop in loadtest(errors) happened due to this request, which previously didn't generate hint, but now it does.

# Misspelled function names
GET http://postgrest/rpc/call_em_x?name=John
Prefer: tx=commit

WIP

Signed-off-by: Taimoor Zaeem <taimoorzaeem@gmail.com>
@steve-chavez
Copy link
Copy Markdown
Member

rate 3966.4539316152445 2903.213981532442 3839.1671719650526
https://github.com/PostgREST/postgrest/actions/runs/24824073137?pr=4837#summary-72655504410

Hm, looks like perf loss is ~26% while we also use more CPU and MEM.

which previously didn't generate hint, but now it does.

🤔 Could you summarize why? Would it be possible to generate a hint with the current implementation to compare more accurately?

Comment thread src/PostgREST/Error.hs
Comment on lines +413 to +434
-- |
-- Check Levenshtein Distance and return hint lower than max distance
checkLevenshteinDistance :: Text -> [Text] -> Int -> Maybe Text
checkLevenshteinDistance _ [] _ = Nothing
checkLevenshteinDistance identName (suggest:suggests) dist =
case Fuzzy.levenshteinLessEqual identName suggest dist of
Just _ -> Just suggest
Nothing -> checkLevenshteinDistance identName suggests dist

-- |
-- Check Levenshtein Distance and return hint with minimum distance
checkMinimumLevenshteinDistance :: Text -> [Text] -> Maybe Text -> Int -> Maybe Text
checkMinimumLevenshteinDistance _ [] Nothing _ = Nothing
checkMinimumLevenshteinDistance _ [] (Just suggest) _ = Just suggest
checkMinimumLevenshteinDistance identName (suggest:suggests) currentSuggest minDist =
let dist = Fuzzy.levenshtein identName suggest
in if dist < minDist
then
if dist == 0 && hintType == HintRelChildren -- Do not give suggestion if the child is found in the relations (dist = 0)
then checkMinimumLevenshteinDistance identName suggests currentSuggest minDist -- Go with current suggestion
else checkMinimumLevenshteinDistance identName suggests (Just suggest) dist -- Update suggestion
else checkMinimumLevenshteinDistance identName suggests currentSuggest minDist -- Go with current suggestion
Copy link
Copy Markdown
Member

@steve-chavez steve-chavez Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the library somehow include this logic? It seems a bit long and related to the algorithm.

@taimoorzaeem
Copy link
Copy Markdown
Member Author

🤔 Could you summarize why? Would it be possible to generate a hint with the current implementation to compare more accurately?

Yeah, I think first off all we should modify this target request to generate error with hint, the comparison would be more clear after that.

In hindsight, I think it used to generate the hint, but after #4684, it doesn't and went unnoticed.

@wolfgangwalther
Copy link
Copy Markdown
Member

I'm not sure I understand why this is created as a separate library, if its only use is for PostgREST and it will have to be maintained by the team anyway. Why not have the code in this repo directly, if we're not using an existing library?

| NoRpc Text Text [Text] MediaType Bool [QualifiedIdentifier] [Routine]
| ColumnNotFound Text Text
| TableNotFound Text Text SchemaCache
| TableNotFound Text Text [Table]
Copy link
Copy Markdown
Collaborator

@mkleczek mkleczek Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See our discussion here:
https://github.com/PostgREST/postgrest/pull/4472/changes#r2596118601

We decided to have SchemaCache here to avoid constant back-and-forth API changes every time the hint calculation algorithm changes (which this edit confirms is a problem :) ).

I'd leave SchemaCache here. There would be no changes in Response and Plan needed - less code churn.

@mkleczek
Copy link
Copy Markdown
Collaborator

mkleczek commented Apr 24, 2026

rate 3966.4539316152445 2903.213981532442 3839.1671719650526
https://github.com/PostgREST/postgrest/actions/runs/24824073137?pr=4837#summary-72655504410

Hm, looks like perf loss is ~26% while we also use more CPU and MEM.

which previously didn't generate hint, but now it does.

🤔 Could you summarize why? Would it be possible to generate a hint with the current implementation to compare more accurately?

Code in main does not calculate any hint for schemas with more than 500 tables (we don't populate the FuzzySet in these cases). Code in errors.sql creates 20000 tables. Code that doesn't do anything cannot be slower than code that linearly processes 20000 entries...

@taimoorzaeem - I understand the algorithm to find a hint is linear in the size of the tables list?

I suspect there is something off with our tests right now and IMHO we need to make sure we test this functionality correctly before doing any changes to the implementation.

@mkleczek
Copy link
Copy Markdown
Collaborator

mkleczek commented Apr 24, 2026

I'm not sure I understand why this is created as a separate library, if its only use is for PostgREST and it will have to be maintained by the team anyway. Why not have the code in this repo directly, if we're not using an existing library?

Well... @taimoorzaeem decided to implement a library that he wishes to maintain as an OSS project (which he's a sole copyright owner from what I read in the LICENSE file).
He also thinks this library is a better fit for PostgREST than the one we're using right now.

I don't think that's any different from any other third-party dependency we have (other than @taimoorzaeem is also a contributor to PostgREST).

@taimoorzaeem
Copy link
Copy Markdown
Member Author

I opened this PR to compare with the loadtest results. Obviously it's in no shape for merging as yet. I recently found out about more ways to optimize the algorithm using BK-trees which I might try. I usually try to work on this library slowly in separate time.

I'm not sure I understand why this is created as a separate library, [..] Why not have the code in this repo directly, if we're not using an existing library?

I think this much code would be very invasive in PostgREST source and that it should be maintained in a separate repo. Maybe in the future we can add more algorithms from https://www.postgresql.org/docs/current/fuzzystrmatch.html. I also think that as a package it might be useful for others in the open source community (although I am skeptical that happens anytime soon).

if its only use is for PostgREST and it will have to be maintained by the team anyway.

How about we move this repo under PostgREST/ organization later after the implementation gets merged? That way the whole team can access it.

@wolfgangwalther
Copy link
Copy Markdown
Member

Well... @taimoorzaeem decided to implement a library that he wishes to maintain as an OSS project (which he's a sole copyright owner from what I read in the LICENSE file). He also thinks this library is a better fit for PostgREST than the one we're using right now.

I don't think that's any different from any other third-party dependency we have (other than @taimoorzaeem is also a contributor to PostgREST).

Well, moving from an unmaintained library to a just released new library with a single contributor and no track record of ongoing maintenance (no offense, Taimoor!) does not look like the right move from the project's perspective.

But that's not my primary concern. My primary concern is, that we have other such libraries already in place:

  • configurator-pg, which was eventually moved to the PostgREST org
  • hasql-notifications, which has taken a tiny bit of traction I believe, because it is used by one other project (ihp) and not just PostgREST, but was originally developed for PostgREST, too (I might be wrong, I'm judging by the author of the lib actually :D)

We already have overhead and friction, because we need to keep these packages building alongside our own, but can't move as freely as we could otherwise.

From practical experience, I don't see any advantage by having configurator-pg as a separate package, it actually just limits our ability to iterate on config-related work faster.

How about we move this repo under PostgREST/ organization later after the implementation gets merged? That way the whole team can access it.

To be clear, my primary concern is not who is maintaining it, but whether it's mono-repo or a separate package.

@mkleczek
Copy link
Copy Markdown
Collaborator

I don't think that's any different from any other third-party dependency we have (other than @taimoorzaeem is also a contributor to PostgREST).

Well, moving from an unmaintained library to a just released new library with a single contributor and no track record of ongoing maintenance (no offense, Taimoor!) does not look like the right move from the project's perspective.

I agree.

From practical experience, I don't see any advantage by having configurator-pg as a separate package, it actually just limits our ability to iterate on config-related work faster.

I would go in the opposite direction - reuse established configuration language and libraries from the market instead of re-inventing something PostgREST specific (see also below).

To be clear, my primary concern is not who is maintaining it, but whether it's mono-repo or a separate package.

The flip side being the growing amount of non PostgREST specific code needed to maintain in PostgREST. Including some generic libraries (like Levenshtein distance implementation or sieve cache) in PostgREST incurs long term costs and risks.

I'm afraid it is trade-offs all the way down - there is a natural tension here, especially in case of Haskell with its weak library ecosystem comparing to, let's say, Java. It needs good judgment on a case by case basis. PostgREST being one of the most prolific Haskell software is a good driver for new libraries and it would be a shame if all the code we implement is PostgREST specific.

@wolfgangwalther
Copy link
Copy Markdown
Member

The flip side being the growing amount of non PostgREST specific code needed to maintain in PostgREST.

Sure, but this argument only works when other, well maintained options are available. Which was clearly not the case for fuzzyset. And was not the case for configurator-ng, so now we have configurator-pg.

The point is: The amount of code to maintain for us is the same, if we're effectively the only users of these libraries. So we might as well make it easier for us to maintain them.

But yes, I agree: If we can use a well maintained and well used library from hackage, that's even better.

I would go in the opposite direction - reuse established configuration language and libraries from the market instead of re-inventing something PostgREST specific (see also below).

This has been suggested before, but we were reluctant to make that breaking change, so far.

@steve-chavez
Copy link
Copy Markdown
Member

I would go in the opposite direction - reuse established configuration language and libraries from the market instead of re-inventing something PostgREST specific (see also below).

This has been suggested before, but we were reluctant to make that breaking change, so far.

Discussed on #3101 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Consider moving away from fuzzyset to another package for error hints

4 participants