Skip to content

fix: Limit concurrent schema cache loads#4643

Open
mkleczek wants to merge 2 commits intoPostgREST:mainfrom
mkleczek:use-advisory-locks-to-throttle-schema-cache-loading
Open

fix: Limit concurrent schema cache loads#4643
mkleczek wants to merge 2 commits intoPostgREST:mainfrom
mkleczek:use-advisory-locks-to-throttle-schema-cache-loading

Conversation

@mkleczek
Copy link
Copy Markdown
Collaborator

@mkleczek mkleczek commented Feb 10, 2026

DISCLAIMER:
This commit was authored entirely by a human without the assistance of LLMs.

Triggering schema cache reload immediately upon receival of notification by the listener leads to thundering herd problem in PostgREST cluster.

This change adds limiting of number of concurrent schema cache loading queries using advisory locks.

Fixes #4642

@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 2 times, most recently from 92cf79e to b63457e Compare February 10, 2026 21:51
Comment thread src/PostgREST/AppState.hs Outdated
-- Allow 10 concurrent schema cache loads, guarded by advisory locks.
-- This is to prevent thundering herd problem on startup or when many PostgREST instances receive "reload schema" notifications at the same time
lockId <- getRandomR (50168275::Int64, 50168275 + 10)
let stmt = SQL.Statement "SELECT pg_catalog.pg_advisory_xact_lock($1)" (HE.param $ HE.nonNullable HE.int8) HD.noResult configDbPreparedStatements
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These locks would be released automatically at the end of the transaction right? It does look like it would work for #4642.

I guess one drawback is that these advisory locks would run and leave a log trace even if the user will never run into #4642, which are most cases.

WDYT of the solution on #4642 (comment)? Would that be preferable?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one drawback

Also that it's a bit more operational overhead, we would also have to recommend setting lock_timeout in addition to statement_timeout to avoid waiting for too long? (like on a schema cache load that takes too long due to pg_catalog bloat)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one drawback

Also that it's a bit more operational overhead, we would also have to recommend setting lock_timeout in addition to statement_timeout to avoid waiting for too long? (like on a schema cache load that takes too long due to pg_catalog bloat)

Is it really an issue? If we get lock timeout we are going to retry anyway.

Copy link
Copy Markdown
Collaborator Author

@mkleczek mkleczek Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These locks would be released automatically at the end of the transaction right? It does look like it would work for #4642.

Yeah, they are tx scoped.

I guess one drawback is that these advisory locks would run and leave a log trace even if the user will never run into #4642, which are most cases.

Maybe we should introduce a config property that activates it then?

WDYT of the solution on #4642 (comment)? Would that be preferable?

See #4642 (comment)

I think for now mitigation of thundering herd problem is way more feasible. At least in the short to medium term.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now mitigation of thundering herd problem is way more feasible. At least in the short to medium term.
Maybe we should introduce a config property that activates it then?

Yes, agree... a config sounds good. Should we parametrize the number of locks? Or should we just hardcode to 10 and expose a boolean config? (Also how do we know 10 is the right number?)

We would also need to test this, seems doable to ensure only 10 connections can exist at a time when say 20 postgREST instances of db-pool=1 + with a PGRST_INTERNAL_SCHEMA_CACHE_QUERY_SLEEP are spawned.

Copy link
Copy Markdown
Collaborator Author

@mkleczek mkleczek Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I think 1 is a good number. Given that I think we should avoid a config and just set it.

Hmm... but why? What harm is in allowing some level of concurrency? Especially that right now there is no limit at all?

@steve-chavez @wolfgangwalther
I think we can actually do better.

How about we adjust the number of locks based on the (estimated) number of nodes connected to the same database?
There are two issues to solve:

  1. How do we estimate the number of cluster nodes?
  2. What should be the algorithm to calculate the number of locks?

Node number estimation

The idea is to estimate that based on:

  • number of active db sessions opened by the same user as session_user
  • number of open connections in the pool

The estimate would be: active_sessions_number / connections_in_the_pool

This assumes the load is spread evenly among cluster members so all nodes should have the same number of open connections.

Number of locks calculation

We need a sublinear function and it seems to me logarithm is well fit. The number of locks would be round(log2(estimated_number_of_nodes))

That way we can allow concurrent schema loads while protecting from thundering herd issue in large cluster.

I've committed implementation of this idea for you to review. If you don't like it we can easily delete the commit. If you think it is OK, we can split into coherent pieces.
Added the test that verifies the level of concurrency for various cluster sizes and the results are as follows:

Nodes Locks
2 2
4 3
6 4
8 4
16 5

WDYT?

Copy link
Copy Markdown
Member

@steve-chavez steve-chavez Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... but why? What harm is in allowing some level of concurrency? Especially that right now there is no limit at all?
If the limit is so low (ie. 1) I am strongly against forcing it on users without a way to opt-out.

Right, so if only one lock can be taken we would be forcing all scache loads to be sequential. If we consider the case of 100 instances (#4642) then all first 99 have to be loaded before the 100th can even begin right? And yes that doesn't look good for the last instance since it will prolong the time it will have a stale schema cache.


Node number estimation
Number of locks calculation

Seems complicated. One simpler idea that ocurrs to me:

  1. Each postgREST instance has a corresponding LISTEN channel. So we know how many concurrent scache loads will happen.
  2. We can also know the time the latest scache query took, since we have the metric
    schemaCacheQueryTime :: Gauge,

Perhaps we can sample the first scache query time (2) and combined with 1 calculate the right number of locks?

Copy link
Copy Markdown
Collaborator Author

@mkleczek mkleczek Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... but why? What harm is in allowing some level of concurrency? Especially that right now there is no limit at all?
If the limit is so low (ie. 1) I am strongly against forcing it on users without a way to opt-out.

Right, so if only one lock can be taken we would be forcing all scache loads to be sequential. If we consider the case of 100 instances (#4642) then all first 99 have to be loaded before the 100th can even begin right? And yes that doesn't look good for the last instance since it will prolong the time it will have a stale schema cache.

Node number estimation
Number of locks calculation

Seems complicated. One simpler idea that ocurrs to me:

  1. Each postgREST instance has a corresponding LISTEN channel. So we know how many concurrent scache loads will happen.

I am afraid I don't understand. How do you want to count the number of Pgrst instances based on LISTEN channel?

  1. We can also know the time the latest scache query took, since we have the metric
    schemaCacheQueryTime :: Gauge,

Perhaps we can sample the first scache query time (2) and combined with 1 calculate the right number of locks?

OK, but what should be the formula to calculate the number of locks?

IMHO, If you already have the estimated number of nodes, calculating log2(number_of_nodes) is as simple as it gets - no additional information required.

The main reason why the proposal in this PR has merit IMHO is all required information is available locally to the node (ie. it is only the number of its opened connections). So acquiring the lock can be done with a single SELECT query taking a single parameter.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I believe this should be a feature instead of a fix.

I think we're looking at two separate issues here:

  1. It's a bug that multiple PostgREST instances just end up as a thundering herd.
  2. There's a performance problem in reloading multiple PostgREST instances at the same time.

We should fix the bug by limiting to 1 concurrent schema cache reloader. We should then discuss how we can best improve performance. This discussion is inflating both.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's a bug that multiple PostgREST instances just end up as a thundering herd.
  2. There's a performance problem in reloading multiple PostgREST instances at the same time.

We should fix the bug by limiting to 1 concurrent schema cache reloader.

Such a "fix" would introduce another (major) bug and hence is not acceptable IMO.

We should then discuss how we can best improve performance. This discussion is inflating both.

Given the above, we must discuss both to come up with the right solution, I'm afraid.

Comment thread src/PostgREST/AppState.hs Outdated
withTxLock <- do
-- Allow 10 concurrent schema cache loads, guarded by advisory locks.
-- This is to prevent thundering herd problem on startup or when many PostgREST instances receive "reload schema" notifications at the same time
lockId <- getRandomR (50168275::Int64, 50168275 + 10)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind this magic number?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind this magic number?

It is just randomly generated large number. We need something with low probability of being used by something else than PostgREST here (but it needs to be hardcoded constant so that there is no risk of instances using different locks).

@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 11 times, most recently from c72cfeb to 8b8ef5c Compare February 20, 2026 16:56
@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch from 8b8ef5c to f3e0692 Compare February 25, 2026 19:58
@mkleczek
Copy link
Copy Markdown
Collaborator Author

@steve-chavez @wolfgangwalther rebased on top of #4672 to make review easier.

@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 3 times, most recently from 50f945f to 0d02360 Compare March 4, 2026 10:03
@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 4 times, most recently from 2017326 to a2876fe Compare March 13, 2026 06:49
@steve-chavez
Copy link
Copy Markdown
Member

Another option to workaround this issue for now #4642 (comment)

@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch from a2876fe to c073b79 Compare March 30, 2026 19:23
@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 9 times, most recently from 16e0c0a to 4ae1ec1 Compare April 21, 2026 04:54
@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 2 times, most recently from 5af6673 to d681d53 Compare April 21, 2026 17:28
@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch from d681d53 to d6c21b0 Compare April 23, 2026 17:47
Comment thread src/PostgREST/AppState.hs
-- and waits for randomly selected lock if no attempt succeeded.
-- It has a single parameter: this node open connection count.
-- It is used to estimate the number of nodes
-- by counting the number of active sessions for current session_user
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this assumes that all other instances are running with the same user as the current session?

That seems overly specific to a certain use-case for something to put into PostgREST.

Copy link
Copy Markdown
Collaborator Author

@mkleczek mkleczek Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this assumes that all other instances are running with the same user as the current session?

That seems overly specific to a certain use-case for something to put into PostgREST.

Is it really overly specific?

I cannot imagine hundreds (or even tens) of PostgREST instances connected to the same database, each configured differently using different user and listening for the same schema changes... What would be the goal of doing that?
The thundering herd issue in reality happens only in clustered environments where identical PostgREST instances are started to horizontally scale the cluster.

This patch prevents thundering herd issue in such cases.

Do you have any other cases in mind?

Also - do you have any ideas how to estimate the size of the PostgREST cluster without this assumption?

@steve-chavez WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can easily imagine a multi-tenant use-case, where each tenant gets their own authenticator role and their own PostgREST instance - but all connecting to the same database.

Copy link
Copy Markdown
Collaborator Author

@mkleczek mkleczek Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can easily imagine a multi-tenant use-case, where each tenant gets their own authenticator role and their own PostgREST instance - but all connecting to the same database.

That would be a schema based multitentancy with shared role structure but for some reason stateless PostgREST instances have to be separate but listening on the same channel. In other words in such setup one tenant running NOTIFY pgrst would force other tenants to reload the schema cache. Why would someone do that???
What's more - how many such PostgREST instances do you imagine in such a setup? So many that it would cause thundering herd problem?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would that be schema based multitenancy? I don't see it that way. They'd all be using the same schema, but need different JWT settings.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we currently reloading the schema on a different connection than where we listen to notifications?

That seems like a bug in itself to me: If we listen on the primary, but load the schema from the replica, there is no guarantee, that the schema changes we actually want to load are already available on the replica. That means we might actually end up with a stale schema cache, even when the replica catches up afterwards.

I had assumed that not to be the case, was my assumption wrong?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we currently reloading the schema on a different connection than where we listen to notifications?

Yes, it's exactly as you mention, LISTEN on the primary and reload on the replica. I had the same concern on #4642 (comment).

Copy link
Copy Markdown
Collaborator Author

@mkleczek mkleczek Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's exactly as you mention, LISTEN on the primary and reload on the replica. I had the same concern on #4642 (comment).

Loading from master is no better but in the opposite direction: an instance can have newer schema cache than the actual schema on the replica.

There are no good solutions with current LISTEN/NOTIFY based architecture but it is unrelated to this PR - see also my comment here: #4642 (comment)

I've created #4842 to discuss it further.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created #4842 to discuss it further.

Since the conclusion in that issue is that "loading the schema cache from the primary is likely the best idea", I suggest we assume for now, that this happens / will be the case.

it causes huge overestimation in case of single master and multiple replicas.

Since we need to start with something and need to make a compromise somewhere, I suggest we ignore this scenario, where we overestimate the schema reloads per replica. We can consider it part of the "schema cache reloads should really happen via the primary instead" issue.

So let's try to find a way to reliably count listener threads against the primary. As pointed out, I believe basing it off of the current role is unreliable. Steve pointed out that basing it off of the application_name is potentially unreliable as well.

Any other ideas?

Copy link
Copy Markdown
Collaborator Author

@mkleczek mkleczek Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created #4842 to discuss it further.

Since the conclusion in that issue is that "loading the schema cache from the primary is likely the best idea", I suggest we assume for now, that this happens / will be the case.

Fixing #4642 (possibly with this PR) must be a prerequisite to the above (otherwise thundering herd vulnerability is amplified as all PostgREST instances are going to hit master).

it causes huge overestimation in case of single master and multiple replicas.

Since we need to start with something and need to make a compromise somewhere, I suggest we ignore this scenario, where we overestimate the schema reloads per replica. We can consider it part of the "schema cache reloads should really happen via the primary instead" issue.

I think we should rather aim for the rule throttling must be done in the same database transaction as schema cache loading (regardless of which connection is used for querying) - thundering herd issue is really per db server instance
This would mean merging this PR (as it implements this strategy) and in subsequent PR changing the connection used for schema cache querying.

These issues are really orthogonal: one is about "what should be the source of schema cache data" and another is about "making sure the source of schema cache data, whatever it is, is not overloaded by concurrent requests".

So let's try to find a way to reliably count listener threads against the primary. As pointed out, I believe basing it off of the current role is unreliable. Steve pointed out that basing it off of the application_name is potentially unreliable as well.

Any other ideas?

There is also another dimension here: it is safer to underestimate than to overestimate the number of PostgREST instances.
Basing the calculation off the current_role can only underestimate instances (ie. there can be other instances connected using different user but they won't be counted towards the limit).
Since all instances, regardless of session_user use the same db locks for throttling, it means that it fixes the thundering herd problem in all scenarios (it is just that the limit might be lower than optimal in the scenario @wolfgangwalther described).

So IMHO it is safe to merge this PR as is and we can change the strategy to be more liberal (ie. not underestimating) in the future.

@steve-chavez @wolfgangwalther WDYT?

@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch from d6c21b0 to 05c0745 Compare April 24, 2026 15:27
@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 6 times, most recently from 00102b8 to 652db7c Compare April 28, 2026 12:26
@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 3 times, most recently from 4ceb77e to 9a59f45 Compare May 3, 2026 11:56
mkleczek added 2 commits May 3, 2026 14:06
Right now metrics observation handler does not track database connections but updates a single Gauge based on HasqlPoolObs events. This is problematic because Hasql pool reports various connection events in multiple phases. The connection state machine is not simple and to precisely report the number of connections in various states, it is necessary to track their lifecycles.

This change adds a ConnTrack data structure and logic to track database connections lifecycles. At the moment it supports "connected" and "inUse" connection counts precisely. The "pgrst_db_pool_available" metric is implemented on top of ConnTrack instead of a simple Gauge.
Triggering schema cache reload immediately upon receival of notification by the listener leads to thundering herd problem in PostgREST cluster.

This change adds limiting of number of concurrent schema cache loading queries using advisory locks.
@mkleczek mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch from 9a59f45 to c93950e Compare May 3, 2026 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Thundering herd problem in PostgREST cluster on AWS ECS

3 participants