fix: Limit concurrent schema cache loads by mkleczek · Pull Request #4643 · PostgREST/postgrest

mkleczek · 2026-02-10T20:18:16Z

DISCLAIMER:
This commit was authored entirely by a human without the assistance of LLMs.

Triggering schema cache reload immediately upon receival of notification by the listener leads to thundering herd problem in PostgREST cluster.

This change adds limiting of number of concurrent schema cache loading queries using advisory locks.

Fixes #4642

steve-chavez · 2026-02-11T20:25:11Z

+        -- Allow 10 concurrent schema cache loads, guarded by advisory locks.
+        -- This is to prevent thundering herd problem on startup or when many PostgREST instances receive "reload schema" notifications at the same time
+        lockId <- getRandomR (50168275::Int64, 50168275 + 10)
+        let stmt = SQL.Statement "SELECT pg_catalog.pg_advisory_xact_lock($1)" (HE.param $ HE.nonNullable HE.int8) HD.noResult configDbPreparedStatements


These locks would be released automatically at the end of the transaction right? It does look like it would work for #4642.

I guess one drawback is that these advisory locks would run and leave a log trace even if the user will never run into #4642, which are most cases.

WDYT of the solution on #4642 (comment)? Would that be preferable?

I guess one drawback

Also that it's a bit more operational overhead, we would also have to recommend setting lock_timeout in addition to statement_timeout to avoid waiting for too long? (like on a schema cache load that takes too long due to pg_catalog bloat)

I guess one drawback

Also that it's a bit more operational overhead, we would also have to recommend setting lock_timeout in addition to statement_timeout to avoid waiting for too long? (like on a schema cache load that takes too long due to pg_catalog bloat)

Is it really an issue? If we get lock timeout we are going to retry anyway.

These locks would be released automatically at the end of the transaction right? It does look like it would work for #4642.

Yeah, they are tx scoped.

I guess one drawback is that these advisory locks would run and leave a log trace even if the user will never run into #4642, which are most cases.

Maybe we should introduce a config property that activates it then?

WDYT of the solution on #4642 (comment)? Would that be preferable?

See #4642 (comment)

I think for now mitigation of thundering herd problem is way more feasible. At least in the short to medium term.

I think for now mitigation of thundering herd problem is way more feasible. At least in the short to medium term.
Maybe we should introduce a config property that activates it then?

Yes, agree... a config sounds good. Should we parametrize the number of locks? Or should we just hardcode to 10 and expose a boolean config? (Also how do we know 10 is the right number?)

We would also need to test this, seems doable to ensure only 10 connections can exist at a time when say 20 postgREST instances of db-pool=1 + with a PGRST_INTERNAL_SCHEMA_CACHE_QUERY_SLEEP are spawned.

Agree, I think 1 is a good number. Given that I think we should avoid a config and just set it.

Hmm... but why? What harm is in allowing some level of concurrency? Especially that right now there is no limit at all?

@steve-chavez @wolfgangwalther
I think we can actually do better.

How about we adjust the number of locks based on the (estimated) number of nodes connected to the same database?
There are two issues to solve:

How do we estimate the number of cluster nodes?

What should be the algorithm to calculate the number of locks?

Node number estimation

The idea is to estimate that based on:

number of active db sessions opened by the same user as session_user

number of open connections in the pool

The estimate would be: active_sessions_number / connections_in_the_pool

This assumes the load is spread evenly among cluster members so all nodes should have the same number of open connections.

Number of locks calculation

We need a sublinear function and it seems to me logarithm is well fit. The number of locks would be round(log2(estimated_number_of_nodes))

That way we can allow concurrent schema loads while protecting from thundering herd issue in large cluster.

I've committed implementation of this idea for you to review. If you don't like it we can easily delete the commit. If you think it is OK, we can split into coherent pieces.
Added the test that verifies the level of concurrency for various cluster sizes and the results are as follows:

Nodes Locks

2 2

4 3

6 4

8 4

16 5

WDYT?

Hmm... but why? What harm is in allowing some level of concurrency? Especially that right now there is no limit at all?
If the limit is so low (ie. 1) I am strongly against forcing it on users without a way to opt-out.

Right, so if only one lock can be taken we would be forcing all scache loads to be sequential. If we consider the case of 100 instances (#4642) then all first 99 have to be loaded before the 100th can even begin right? And yes that doesn't look good for the last instance since it will prolong the time it will have a stale schema cache.

Node number estimation
Number of locks calculation

Seems complicated. One simpler idea that ocurrs to me:

Each postgREST instance has a corresponding LISTEN channel. So we know how many concurrent scache loads will happen.

We can also know the time the latest scache query took, since we have the metric

postgrest/src/PostgREST/Metrics.hs

Line 29 in 66fda76

schemaCacheQueryTime :: Gauge,

Perhaps we can sample the first scache query time (2) and combined with 1 calculate the right number of locks?

Hmm... but why? What harm is in allowing some level of concurrency? Especially that right now there is no limit at all?
If the limit is so low (ie. 1) I am strongly against forcing it on users without a way to opt-out.

Right, so if only one lock can be taken we would be forcing all scache loads to be sequential. If we consider the case of 100 instances (#4642) then all first 99 have to be loaded before the 100th can even begin right? And yes that doesn't look good for the last instance since it will prolong the time it will have a stale schema cache.

Node number estimation
Number of locks calculation

Seems complicated. One simpler idea that ocurrs to me:

Each postgREST instance has a corresponding LISTEN channel. So we know how many concurrent scache loads will happen.

I am afraid I don't understand. How do you want to count the number of Pgrst instances based on LISTEN channel?

We can also know the time the latest scache query took, since we have the metric

postgrest/src/PostgREST/Metrics.hs

Line 29 in 66fda76

schemaCacheQueryTime :: Gauge,

Perhaps we can sample the first scache query time (2) and combined with 1 calculate the right number of locks?

OK, but what should be the formula to calculate the number of locks?

IMHO, If you already have the estimated number of nodes, calculating log2(number_of_nodes) is as simple as it gets - no additional information required.

The main reason why the proposal in this PR has merit IMHO is all required information is available locally to the node (ie. it is only the number of its opened connections). So acquiring the lock can be done with a single SELECT query taking a single parameter.

Also I believe this should be a feature instead of a fix.

I think we're looking at two separate issues here:

It's a bug that multiple PostgREST instances just end up as a thundering herd.

There's a performance problem in reloading multiple PostgREST instances at the same time.

We should fix the bug by limiting to 1 concurrent schema cache reloader. We should then discuss how we can best improve performance. This discussion is inflating both.

It's a bug that multiple PostgREST instances just end up as a thundering herd.

There's a performance problem in reloading multiple PostgREST instances at the same time.

We should fix the bug by limiting to 1 concurrent schema cache reloader.

Such a "fix" would introduce another (major) bug and hence is not acceptable IMO.

We should then discuss how we can best improve performance. This discussion is inflating both.

Given the above, we must discuss both to come up with the right solution, I'm afraid.

steve-chavez · 2026-02-14T19:05:55Z

+      withTxLock <- do
+        -- Allow 10 concurrent schema cache loads, guarded by advisory locks.
+        -- This is to prevent thundering herd problem on startup or when many PostgREST instances receive "reload schema" notifications at the same time
+        lockId <- getRandomR (50168275::Int64, 50168275 + 10)


What's the reasoning behind this magic number?

What's the reasoning behind this magic number?

It is just randomly generated large number. We need something with low probability of being used by something else than PostgREST here (but it needs to be hardcoded constant so that there is no risk of instances using different locks).

mkleczek · 2026-02-25T20:00:38Z

@steve-chavez @wolfgangwalther rebased on top of #4672 to make review easier.

steve-chavez · 2026-03-19T18:02:33Z

Another option to workaround this issue for now #4642 (comment)

wolfgangwalther · 2026-04-24T10:55:15Z

+        -- and waits for randomly selected lock if no attempt succeeded.
+        -- It has a single parameter: this node open connection count.
+        -- It is used to estimate the number of nodes
+        -- by counting the number of active sessions for current session_user


So this assumes that all other instances are running with the same user as the current session?

That seems overly specific to a certain use-case for something to put into PostgREST.

So this assumes that all other instances are running with the same user as the current session?

That seems overly specific to a certain use-case for something to put into PostgREST.

Is it really overly specific?

I cannot imagine hundreds (or even tens) of PostgREST instances connected to the same database, each configured differently using different user and listening for the same schema changes... What would be the goal of doing that?
The thundering herd issue in reality happens only in clustered environments where identical PostgREST instances are started to horizontally scale the cluster.

This patch prevents thundering herd issue in such cases.

Do you have any other cases in mind?

Also - do you have any ideas how to estimate the size of the PostgREST cluster without this assumption?

@steve-chavez WDYT?

I can easily imagine a multi-tenant use-case, where each tenant gets their own authenticator role and their own PostgREST instance - but all connecting to the same database.

I can easily imagine a multi-tenant use-case, where each tenant gets their own authenticator role and their own PostgREST instance - but all connecting to the same database.

That would be a schema based multitentancy with shared role structure but for some reason stateless PostgREST instances have to be separate but listening on the same channel. In other words in such setup one tenant running NOTIFY pgrst would force other tenants to reload the schema cache. Why would someone do that???
What's more - how many such PostgREST instances do you imagine in such a setup? So many that it would cause thundering herd problem?

Why would that be schema based multitenancy? I don't see it that way. They'd all be using the same schema, but need different JWT settings.

Are we currently reloading the schema on a different connection than where we listen to notifications?

That seems like a bug in itself to me: If we listen on the primary, but load the schema from the replica, there is no guarantee, that the schema changes we actually want to load are already available on the replica. That means we might actually end up with a stale schema cache, even when the replica catches up afterwards.

I had assumed that not to be the case, was my assumption wrong?

Are we currently reloading the schema on a different connection than where we listen to notifications?

Yes, it's exactly as you mention, LISTEN on the primary and reload on the replica. I had the same concern on #4642 (comment).

Yes, it's exactly as you mention, LISTEN on the primary and reload on the replica. I had the same concern on #4642 (comment).

Loading from master is no better but in the opposite direction: an instance can have newer schema cache than the actual schema on the replica.

There are no good solutions with current LISTEN/NOTIFY based architecture but it is unrelated to this PR - see also my comment here: #4642 (comment)

I've created #4842 to discuss it further.

I've created #4842 to discuss it further.

Since the conclusion in that issue is that "loading the schema cache from the primary is likely the best idea", I suggest we assume for now, that this happens / will be the case.

it causes huge overestimation in case of single master and multiple replicas.

Since we need to start with something and need to make a compromise somewhere, I suggest we ignore this scenario, where we overestimate the schema reloads per replica. We can consider it part of the "schema cache reloads should really happen via the primary instead" issue.

So let's try to find a way to reliably count listener threads against the primary. As pointed out, I believe basing it off of the current role is unreliable. Steve pointed out that basing it off of the application_name is potentially unreliable as well.

Any other ideas?

I've created #4842 to discuss it further.

Since the conclusion in that issue is that "loading the schema cache from the primary is likely the best idea", I suggest we assume for now, that this happens / will be the case.

Fixing #4642 (possibly with this PR) must be a prerequisite to the above (otherwise thundering herd vulnerability is amplified as all PostgREST instances are going to hit master).

it causes huge overestimation in case of single master and multiple replicas.

Since we need to start with something and need to make a compromise somewhere, I suggest we ignore this scenario, where we overestimate the schema reloads per replica. We can consider it part of the "schema cache reloads should really happen via the primary instead" issue.

I think we should rather aim for the rule throttling must be done in the same database transaction as schema cache loading (regardless of which connection is used for querying) - thundering herd issue is really per db server instance
This would mean merging this PR (as it implements this strategy) and in subsequent PR changing the connection used for schema cache querying.

These issues are really orthogonal: one is about "what should be the source of schema cache data" and another is about "making sure the source of schema cache data, whatever it is, is not overloaded by concurrent requests".

So let's try to find a way to reliably count listener threads against the primary. As pointed out, I believe basing it off of the current role is unreliable. Steve pointed out that basing it off of the application_name is potentially unreliable as well.

Any other ideas?

There is also another dimension here: it is safer to underestimate than to overestimate the number of PostgREST instances.
Basing the calculation off the current_role can only underestimate instances (ie. there can be other instances connected using different user but they won't be counted towards the limit).
Since all instances, regardless of session_user use the same db locks for throttling, it means that it fixes the thundering herd problem in all scenarios (it is just that the limit might be lower than optimal in the scenario @wolfgangwalther described).

So IMHO it is safe to merge this PR as is and we can change the strategy to be more liberal (ie. not underestimating) in the future.

@steve-chavez @wolfgangwalther WDYT?

Right now metrics observation handler does not track database connections but updates a single Gauge based on HasqlPoolObs events. This is problematic because Hasql pool reports various connection events in multiple phases. The connection state machine is not simple and to precisely report the number of connections in various states, it is necessary to track their lifecycles. This change adds a ConnTrack data structure and logic to track database connections lifecycles. At the moment it supports "connected" and "inUse" connection counts precisely. The "pgrst_db_pool_available" metric is implemented on top of ConnTrack instead of a simple Gauge.

Triggering schema cache reload immediately upon receival of notification by the listener leads to thundering herd problem in PostgREST cluster. This change adds limiting of number of concurrent schema cache loading queries using advisory locks.

mkleczek mentioned this pull request Feb 10, 2026

fix: Add random delay in Listener.cacheReloader #4639

Closed

mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 2 times, most recently from 92cf79e to b63457e Compare February 10, 2026 21:51

steve-chavez reviewed Feb 11, 2026

View reviewed changes

steve-chavez reviewed Feb 14, 2026

View reviewed changes

mkleczek mentioned this pull request Feb 15, 2026

Introduce experimental-config-options configuration property #4654

Open

mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 11 times, most recently from c72cfeb to 8b8ef5c Compare February 20, 2026 16:56

mkleczek mentioned this pull request Feb 25, 2026

refactor(test): provide means to validate metrics and observations #4671

Merged

mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch from 8b8ef5c to f3e0692 Compare February 25, 2026 19:58

mkleczek mentioned this pull request Feb 26, 2026

fix: implement connection tracking in metrics #4672

Open

mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 3 times, most recently from 50f945f to 0d02360 Compare March 4, 2026 10:03

mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 4 times, most recently from 2017326 to a2876fe Compare March 13, 2026 06:49

mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch from a2876fe to c073b79 Compare March 30, 2026 19:23

mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 9 times, most recently from 16e0c0a to 4ae1ec1 Compare April 21, 2026 04:54

mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 2 times, most recently from 5af6673 to d681d53 Compare April 21, 2026 17:28

mkleczek requested review from steve-chavez and wolfgangwalther April 22, 2026 13:20

mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch from d681d53 to d6c21b0 Compare April 23, 2026 17:47

wolfgangwalther reviewed Apr 24, 2026

View reviewed changes

mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch from d6c21b0 to 05c0745 Compare April 24, 2026 15:27

mkleczek mentioned this pull request Apr 25, 2026

Race condition in reload schema notification handling when PostgREST connects to hot standby replicas #4842

Open

mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 6 times, most recently from 00102b8 to 652db7c Compare April 28, 2026 12:26

wolfgangwalther mentioned this pull request Apr 29, 2026

fix: Restore showing LISTEN pgrst in pg_stat_activity #4859

Merged

mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch 3 times, most recently from 4ceb77e to 9a59f45 Compare May 3, 2026 11:56

mkleczek added 2 commits May 3, 2026 14:06

fix: Limit concurrent schema cache loads

c93950e

Triggering schema cache reload immediately upon receival of notification by the listener leads to thundering herd problem in PostgREST cluster. This change adds limiting of number of concurrent schema cache loading queries using advisory locks.

mkleczek force-pushed the use-advisory-locks-to-throttle-schema-cache-loading branch from 9a59f45 to c93950e Compare May 3, 2026 12:07

Uh oh!

Conversation

mkleczek commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkleczek Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkleczek Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Node number estimation

Number of locks calculation

Uh oh!

steve-chavez Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkleczek Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkleczek commented Feb 25, 2026

Uh oh!

steve-chavez commented Mar 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkleczek Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkleczek Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkleczek Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkleczek Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

mkleczek commented Feb 10, 2026 •

edited

Loading

mkleczek Feb 11, 2026 •

edited

Loading

mkleczek Feb 20, 2026 •

edited

Loading

steve-chavez Feb 20, 2026 •

edited

Loading

mkleczek Feb 20, 2026 •

edited

Loading

mkleczek Apr 24, 2026 •

edited

Loading

mkleczek Apr 24, 2026 •

edited

Loading

mkleczek Apr 25, 2026 •

edited

Loading

mkleczek Apr 27, 2026 •

edited

Loading