Skip to content

add: use SO_REUSEPORT on platform supporting it#4703

Open
mkleczek wants to merge 2 commits intoPostgREST:mainfrom
mkleczek:so-reuseport
Open

add: use SO_REUSEPORT on platform supporting it#4703
mkleczek wants to merge 2 commits intoPostgREST:mainfrom
mkleczek:so-reuseport

Conversation

@mkleczek
Copy link
Copy Markdown
Collaborator

@mkleczek mkleczek commented Mar 10, 2026

DISCLAIMER:
This commit was authored entirely by a human without the assistance of LLMs.

Fixes #4694

Stacked on top of #4702 as it is not enough to start a new instance, it is also necessary not to fail in-flight requests on the old instance.

@mkleczek mkleczek force-pushed the so-reuseport branch 3 times, most recently from c252511 to 27c16d7 Compare March 10, 2026 15:43
@steve-chavez
Copy link
Copy Markdown
Member

This is failing all tests, I'd suggest to put these type of PRs as draft.

@mkleczek
Copy link
Copy Markdown
Collaborator Author

This is failing all tests, I'd suggest to put these type of PRs as draft.

Hmm... worked before latest push. Will switch to draft and fix.

@mkleczek mkleczek marked this pull request as draft March 10, 2026 16:31
@mkleczek
Copy link
Copy Markdown
Collaborator Author

This is failing all tests, I'd suggest to put these type of PRs as draft.

Hmm... worked before latest push. Will switch to draft and fix.

@steve-chavez - it looks like the issue is that on my machine PostgREST startup is fast enough so that it loads the schema cache before it accepts any requests. Here in CI the new instance fails with 503 because it didn't yet load the schema cache.

The question: is there any particular reason why we return 503, instead of simply not start listening on the socket until schema cache is loaded? The way we have it right now means we can't really support zero-downtime upgrades because once we start the new instance but before it loads the schema cache, some clients will get 503.

@steve-chavez
Copy link
Copy Markdown
Member

steve-chavez commented Mar 10, 2026

The question: is there any particular reason why we return 503, instead of simply not start listening on the socket until schema cache is loaded?

We were aiming to have requests wait instead of 503, this waiting does happen during schema cache reload but not on startup; we discussed this on #4129. Would it be better to not listen on the socket? How would clients behave in this case?

@develop7
Copy link
Copy Markdown
Collaborator

They would get a "connection refused" error, which means nobody there and is imo more confusing that any 5xx error. UX-wise I would prefer some waiting to a presumably hard fail any day.

@mkleczek
Copy link
Copy Markdown
Collaborator Author

mkleczek commented Mar 11, 2026

We were aiming to have requests wait instead of 503, this waiting does happen during schema cache reload but not on startup; we discussed this on #4129. Would it be better to not listen on the socket? How would clients behave in this case?

They would get a "connection refused" error, which means nobody there and is imo more confusing that any 5xx error. UX-wise I would prefer some waiting to a presumably hard fail any day.

I am not convinced, see below.

This is a complex topic so let's dig into it a little more. The startup sequence right now is:

  1. Postgrest is not running.
  2. Clients get connection refused
  3. Postgrest starts listening on a socket
  4. Clients get 503
  5. Postgrest loaded schema cache
  6. Normal traffic

The alternatives are:
a - blocking during schema cache loading

  1. Postgrest is not running.
  2. Clients get connection refused
  3. Postgrest starts listening on a socket
  4. Clients are blocked and potentially timeout getting some network error
  5. Postgrest loaded schema cache
  6. Normal traffic

b - listening on a socket only after schema cache loaded

  1. Postgrest is not running.
  2. Clients get connection refused
  3. Postgrest loaded schema cache and starts listening
  4. Normal traffic

So from the point of view of the clients (they don't know when Postgrest was started), we have 3 alternatives:

  1. connection refused -> 503 -> normal
  2. connection refused -> blocked/time out -> normal
  3. connection refused -> normal

I am not sure what value clients get from the first two options comparing to the third one. Diagnostics and readiness checks should be done using admin server anyway.

In case of SO_REUSEPORT the situation is even worse if we start listening early. We have the following situation: instance 1 is running, instance 2 is started. From the point of view of clients:

  1. Today: normal traffic -> some clients get 503 (both instances are listening only one is ready) -> normal traffic
  2. Blocking: normal traffic -> some clients blocked/time out (both instances are listening one is blocking) -> normal traffic
  3. Not listening: just normal traffic (there are no disruptions at all because all requests are handled by instance 1 until instance 2 is ready and starts listening).

So the first two options cause disruptions whereas the third one is fully zero-downtime and transparent to the clients.

My take on it would be:

  1. Start admin server as early as possible.
  2. Load schema cache.
  3. Start listening on main socket.

This would require splitting binding from listening on the main socket (ie. we need to bind without listening first so that we can pass the socket to the admin server).

@steve-chavez @develop7 thoughts?

@mkleczek
Copy link
Copy Markdown
Collaborator Author

Dependent on resolving (or having a workaround to) yesodweb/wai#853

@mkleczek mkleczek force-pushed the so-reuseport branch 2 times, most recently from 64bb6ca to 2cb6503 Compare March 11, 2026 18:49
@steve-chavez
Copy link
Copy Markdown
Member

So the first two options cause disruptions whereas the third one is fully zero-downtime and transparent to the clients.
My take on it would be:

Agree, sounds much better.

@mkleczek mkleczek force-pushed the so-reuseport branch 2 times, most recently from 9ebfb7f to 3f522cc Compare April 28, 2026 10:32
Comment thread CHANGELOG.md Outdated
- Fix login with uppercase and mixed case role names by @taimoorzaeem in #4678
- Remove automatic transaction retries on `40001 (serialization_failure)` errors to prevent replication lag by @laurenceisla in #3673
- Remove automatic transaction retries on `40001 (serialization_failure)` errors to prevent replication lag by @laurence in #3673
- Enable starting multiple PostgREST instances using the same ports on platforms supporting it by @mkleczek in #4703 #4694
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting, is that true? Can I start multiple instances and run them in parallel? I thought this was only possible in some kind of seamless transition way while starting one and stopping the other, but this reads as if I could run them in parallel indefinitely?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. SO_REUSEPORT is a poor man's loadbalancer

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we have to keep that in mind for odd test failures or bug reports. Where the user would get a "port already in use" error before, when they accidentally started multiple instances of postgrest, they will now get requests randomly served from different instances. This might be very confusing...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah… We might guard it with a config flag that’s off by default. But I’m not too keen on adding yet another config property.

Open to suggestions though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not too keen on adding yet another config property.

same!

I don't have any better suggestions right now, I think reusing the port is a sane behavior for most actual deployments.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. If we ever wanted to reuse the admin server port, we should add admin-server-reuseport as a config option. server-xxx should not affect the admin server.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkleczek Seeing the docs you added here ("Admin ports are not shared between processes. This keeps readiness checks unambiguous"), would it ever make sense to have reuseport for the admin server? Wouldn't that invalidate zero-down time upgrades?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkleczek Seeing the docs you added here ("Admin ports are not shared between processes. This keeps readiness checks unambiguous"), would it ever make sense to have reuseport for the admin server?

IMHO - no.
That's a very common problem in clustered environments: it does not make sense to load balance metrics and health endpoints. SO_REUSEPORT is a form of load balancing and the same rule applies.

Wouldn't that invalidate zero-down time upgrades?

I am not sure I understand.
You can have zero downtime upgrades without admin server running at all. Just start the new instance, wait for it to log "schema cache loaded" (or just wait long enough, ie. 1 minute should be safe), then stop the first instance.

Copy link
Copy Markdown
Member

@steve-chavez steve-chavez May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving this conversation since I see the server-reuseport config hasn't been added yet. As agreed that should only affect the API server not the admin server.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented server-reuseport configuration property.

@mkleczek mkleczek marked this pull request as ready for review April 28, 2026 11:13
@mkleczek mkleczek force-pushed the so-reuseport branch 2 times, most recently from a420618 to f61afb0 Compare April 28, 2026 18:26
Comment thread docs/how-tos/zero-downtime-upgrades.rst Outdated
@mkleczek mkleczek force-pushed the so-reuseport branch 2 times, most recently from 2b8643e to 6c64080 Compare April 29, 2026 09:04
Comment thread docs/how-tos/zero-downtime-upgrades.rst Outdated
Comment thread test/io/test_io.py Outdated
@mkleczek mkleczek force-pushed the so-reuseport branch 3 times, most recently from 033bd60 to 127490e Compare April 29, 2026 09:59
Comment thread docs/how-tos/zero-downtime-upgrades.rst
@wolfgangwalther
Copy link
Copy Markdown
Member

Conflicted in the changelog.

Comment thread src/PostgREST/App.hs
Comment on lines +85 to +103
mainSocketRef <- newIORef Nothing
adminSocket <- initAdminServerSocket conf

let closeSockets = do
whenJust adminSocket NS.close
NS.close mainSocket
readIORef mainSocketRef >>= foldMap NS.close
Unix.installSignalHandlers observer closeSockets (AppState.schemaCacheLoader appState) (AppState.readInDbConfig False appState)

Admin.runAdmin appState adminSocket (readIORef mainSocketRef) (serverSettings conf)

Listener.runListener appState

Admin.runAdmin appState adminSocket mainSocket (serverSettings conf)
-- Kick off and wait for the initial SchemaCache load before creating the
-- main API socket.
AppState.schemaCacheLoader appState
AppState.waitForSchemaCacheLoaded appState

mainSocket <- initServerSocket conf
atomicWriteIORef mainSocketRef $ Just mainSocket
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkleczek Was the startup sequence change you mentioned on #4703 (comment), done here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Copy Markdown
Member

@steve-chavez steve-chavez May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkleczek I think we should do that one on another PR since we agreed it could be independent from this one (ref). That would also give us a change to fix #4129 with a test and discuss further if needed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkleczek I think we should do that one on another PR since we agreed it could be independent from this one (ref). That would also give us a change to fix #4129 with a test and discuss further if needed.

I stacked this one on top of #4880


Zero-Downtime Upgrades
======================

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:author: `mkleczek <https://github.com/mkleczek>`_

We've been doing this for almost all how-tos:

Comment thread docs/references/configuration.rst Outdated

The TCP port to bind the web server. Use ``0`` to automatically assign a port.

On operating systems that support ``SO_REUSEPORT``, you can start multiple
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put a heading and anchor here so we can link it from other places

Suggested change
On operating systems that support ``SO_REUSEPORT``, you can start multiple
.. _reuseport:
SO_REUSEPORT
~~~~~~~~~~~~~
On operating systems that support ``SO_REUSEPORT``, you can start multiple

@mkleczek mkleczek force-pushed the so-reuseport branch 2 times, most recently from 0e7d4aa to 680c6e1 Compare May 4, 2026 17:53
@mkleczek mkleczek force-pushed the so-reuseport branch 2 times, most recently from c12c6fd to bec258c Compare May 5, 2026 09:27
mkleczek added 2 commits May 5, 2026 20:51
This change ensures PostgREST starts listening on a server socket only after it loaded the schema cache and is ready to handle requests. It is no longer going to return 503 errors during startup until the schema cache is loaded.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Start Warp with SO_REUSEPORT on supporting platforms

4 participants