Implement scalable remote control-plane coordination by bill-ph · Pull Request #374 · PostHog/duckgres

bill-ph · 2026-03-27T15:41:52Z

Summary

add durable runtime coordination for scalable remote control planes, including control-plane instance tracking, worker records, and ownership fencing
make shared warm-pool maintenance leader-only and cluster-wide, with janitor cleanup for stale replicas, orphaned workers, stuck workers, and overdue draining replicas
harden Flight reconnect and worker control semantics by routing reconnects through the durable org binding and fencing worker operations with durable owner metadata

Behavior

multiple control-plane replicas can now coordinate worker claims and background cleanup through the runtime config-store tables
pgwire connections still drop on control-plane failure; Flight reconnect may recover when the worker survives and the durable token remains valid
neutral warm workers are shared capacity, but tenant-activated workers remain disposable and are retired after use

Testing

just test-unit
just test-controlplane-k8s
just test-configstore-integration
just lint

More Details

Design
The current design is a horizontally scalable control plane with shared coordination in Postgres, but with a strict isolation boundary for tenant-touched workers.

Core model:

Multiple control-plane replicas may serve traffic.
Runtime coordination lives in the config-store Postgres DB, not in one replica’s memory.
Shared warm workers start as neutral idle capacity.
A control-plane replica claims an idle worker transactionally, which creates a temporary lease:
- set org assignment
- set owner_cp_instance_id
- bump owner_epoch
owner_epoch is the fencing token. It prevents stale replicas from continuing to control a worker after ownership changes.
Worker RPCs validate full ownership metadata, not just epoch:
- owner_epoch
- cp_instance_id
- worker_id
Once a worker has been activated for a tenant, it is disposable.
After its last session ends, it is retired, not returned to the shared pool.
New neutral warm capacity comes from spawning fresh workers, not cross-tenant recycling.
One elected janitor replica runs background reconciliation:
- expire stale control-plane instances
- retire orphaned/stuck workers
- maintain warm capacity
- prune expired Flight reconnect state

Important consequence:

Idle warm workers are shared pool resources.
Leased workers are temporarily owned.
Tenant-touched workers are never reused across tenants.

Common Scenarios

Shared warm pool steady state

The janitor keeps some number of neutral idle workers available.
These workers have no meaningful live tenant binding.
They exist only to reduce cold-start latency.

New pgwire session

A client connects to any control-plane replica.
The replica tries to claim an idle worker from the shared pool using a Postgres transaction.
If it wins a claim, it sets ownership metadata and reserves that worker for the target org.
The worker is activated for that tenant runtime.
The control plane creates a worker-local session and proxies the client.
When the last session ends, that worker is retired.
Replacement warm capacity is replenished separately by spawning another neutral worker.

New Flight session

Same worker-claim path as pgwire.
The control plane also writes a durable Flight reconnect record.
That record binds the session token to:
- org
- worker
- owner metadata
- TTL
If the serving control plane stays healthy, the session behaves normally.

Two control planes race for the same idle worker

Both try to claim an idle row in Postgres.
FOR UPDATE SKIP LOCKED / transactional update lets only one claim succeed.
The winner bumps owner_epoch and becomes the lease holder.
Even if the loser had stale local knowledge, worker-side fencing rejects stale RPCs later.

Control-plane crash during pgwire

The client connection dies.
Existing pgwire session is lost.
The worker may still be alive for a short time, but that does not preserve the pgwire session.
Janitor or takeover logic retires or reconciles(flight) the worker.
Client reconnects through another control-plane replica.

Control-plane crash during Flight

The gRPC stream dies.
The client reconnects with the Flight token before TTL expiry.
The new control-plane replica looks up the durable reconnect record.
If the worker still exists and takeover is safe, it can rebind to the same worker by taking ownership with a newer epoch.
This is reconnect-and-rebind, not transparent continuation of the old stream.
If takeover is not possible, the reconnect fails and the client must start a new session.

Planned rollout

A replica enters draining mode.
It stops taking new work.
Existing sessions are allowed to continue up to the configured drain timeout.
Because this is planned, active sessions should usually survive by remaining pinned to their current control plane until completion.
After timeout or full drain, the old replica exits.
New replicas take new traffic.
This is different from crash failover, where connections are lost.

Orphaned worker cleanup

If a control-plane replica disappears without cleaning up, some workers may still be running.
The janitor notices the owner replica’s heartbeat expired.
It marks those workers orphaned/stale and retires them safely.
This prevents leaked capacity and split-brain control.

Org worker cap enforcement

Org caps are enforced both when spawning new workers and when claiming warm idle workers.
That prevents multiple replicas from exceeding the per-org worker limit just by racing on shared warm capacity.

Operational Semantics

Horizontal scalability means multiple control planes can accept new traffic.
It does not mean existing pgwire connections survive control-plane failure.
Isolation is favored over aggressive worker reuse.
Shared pool reuse is only for never-activated neutral workers.
Tenant-activated workers are single-tenant and disposable after use.

EDsCODE · 2026-03-27T17:43:08Z

PR #374 Review: Scalable Remote Control-Plane Coordination

+5468 / -419 across 55 files, 33 commits

This is a substantial architectural change that moves from single-process control-plane coordination to horizontally scalable, database-backed runtime coordination. Overall the design is solid — epoch-based fencing, advisory locks for capacity gates, durable Flight reconnect, and leader-elected janitor cleanup are all well thought out. Here are my findings:

High Severity

1. TakeOverWorker returns nil on epoch mismatch — indistinguishable from "no worker found"
controlplane/configstore/store.go — when the epoch check fails, the function returns nil, nil (no error). Callers like claimSpecificWorker in k8s_pool.go can't distinguish "stale epoch" from "worker doesn't exist", which could cause infinite retry loops on a permanently-stale epoch. Consider returning a sentinel error (e.g., ErrEpochMismatch) so callers can break out.

2. Concurrent Flight reconnect race can orphan worker sessions
server/flightsqlingress/ingress.go:~1748 — if a client sends two reconnect requests simultaneously, both can pass the durable store lookup, both call ReconnectSession, both get new pids, and the second overwrites the first in s.sessions[token]. The first pid's worker session becomes orphaned. A sync.Mutex or compare-and-swap on the token during reconnect registration would fix this.

3. Janitor early-return on first error skips remaining cleanup
controlplane/janitor.go:~80 — runOnce() returns on the first error from ExpireControlPlaneInstances. If that fails (e.g., transient DB error), orphaned worker cleanup, stuck worker cleanup, warm capacity reconciliation, and Flight session expiry all get skipped for that cycle. Consider logging and continuing rather than returning early, or at minimum running all cleanup steps independently.

Medium Severity

4. No liveness recheck on the takeover path
controlplane/k8s_pool.go — reserveClaimedWorker does a liveness recheck for warm-pool claims (~line 1035) but claimSpecificWorker for takeover (~line 1199) skips it. A dead pod can be returned to the reservation flow.

5. Reconnect failure doesn't update durable record state
server/flightsqlingress/ingress.go:~1735 — when ReconnectSession fails, the error is logged but the durable record stays active. The client will retry immediately with no backoff, hammering the reconnect path. Consider marking the record as expired/failed, or adding a retry-after hint.

6. Lease expiry not checked during query execution
controlplane/org_reserved_pool.go:~269 — worker lease is set on claim (default 24h) but never refreshed during long-running queries. If a query exceeds the lease, the worker could theoretically be reassigned mid-execution. Low probability with 24h default, but worth a comment or a lease-refresh mechanism for safety.

7. Find vs Get inconsistency in store
controlplane/configstore/store.go — GetFlightSessionRecord() returns an error on not-found, but FindFlightSessionRecord() returns (nil, nil). This inconsistency will confuse callers. Pick a convention (Go standard: Get errors, Find returns nil) and document it.

Low Severity

8. Drain blocks indefinitely on long-running queries
server/flightsqlingress/ingress.go:~1664 — drain waits for all sessions to become idle. A session with a long-running query blocks drain until HandoverDrainTimeout (15m), at which point sessions are force-closed without updating durable records. Orphaned queries on the worker continue executing. Probably acceptable operationally but worth documenting.

9. No draining check on reconnect path
server/flightsqlingress/ingress.go — during drain, new pgwire connections are rejected but Flight reconnects can still succeed. The reconnected session will be unable to do meaningful work since the CP is shutting down. Rejecting reconnects during drain would give clients a cleaner signal to connect to a new replica.

10. Timing-sensitive tests
controlplane/runtime_tracker_test.go and controlplane/janitor_test.go use time.Sleep with narrow windows (10-25ms intervals, 250ms deadlines). These will be flaky under CI load. Consider channel-based synchronization or larger margins.

11. Resource limits may be tight
k8s/control-plane-deployment.yaml — requests: 100m/128Mi, limits: 500m/256Mi. May be fine for a thin proxy but could be tight with leader election, janitor loops, and Flight session management under load. Worth monitoring.

Positives

Epoch fencing is thorough: Three-layer defense (database CAS, in-memory state, per-RPC validation) consistently applied across all worker control RPCs
SQL injection safe: All queries use parameterized inputs; schema names go through quoteIdentifier()
Advisory locks for capacity gating: Clean pattern for preventing over-provisioning across concurrent replicas
Reconnect uses OrgID from durable record, not current user: Prevents session hijacking across org boundaries
CI improvements are solid: ECR mirror, SHA-pinned kind images, retry logic all improve reliability
Manifest test (k8s_manifest_test.go) catching readiness probe regressions at build time is a nice touch
Runbooks are clear, accurate, and operationally useful
Label-based ownership is the right move over OwnerReferences for graceful drain support

Bottom Line

The core design is sound and the fencing mechanism is well-implemented where it matters most (the RPC boundary). The high-severity items (#1, #2, #3) are worth fixing before merge — they're all straightforward. The rest are improvements that could be addressed in follow-ups.

bill-ph · 2026-03-27T19:20:53Z

PR #374 Review: Scalable Remote Control-Plane Coordination

+5468 / -419 across 55 files, 33 commits

This is a substantial architectural change that moves from single-process control-plane coordination to horizontally scalable, database-backed runtime coordination. Overall the design is solid — epoch-based fencing, advisory locks for capacity gates, durable Flight reconnect, and leader-elected janitor cleanup are all well thought out. Here are my findings:

High Severity

1. TakeOverWorker returns nil on epoch mismatch — indistinguishable from "no worker found" controlplane/configstore/store.go — when the epoch check fails, the function returns nil, nil (no error). Callers like claimSpecificWorker in k8s_pool.go can't distinguish "stale epoch" from "worker doesn't exist", which could cause infinite retry loops on a permanently-stale epoch. Consider returning a sentinel error (e.g., ErrEpochMismatch) so callers can break out.

2. Concurrent Flight reconnect race can orphan worker sessions server/flightsqlingress/ingress.go:~1748 — if a client sends two reconnect requests simultaneously, both can pass the durable store lookup, both call ReconnectSession, both get new pids, and the second overwrites the first in s.sessions[token]. The first pid's worker session becomes orphaned. A sync.Mutex or compare-and-swap on the token during reconnect registration would fix this.

3. Janitor early-return on first error skips remaining cleanup controlplane/janitor.go:~80 — runOnce() returns on the first error from ExpireControlPlaneInstances. If that fails (e.g., transient DB error), orphaned worker cleanup, stuck worker cleanup, warm capacity reconciliation, and Flight session expiry all get skipped for that cycle. Consider logging and continuing rather than returning early, or at minimum running all cleanup steps independently.

Medium Severity

4. No liveness recheck on the takeover path controlplane/k8s_pool.go — reserveClaimedWorker does a liveness recheck for warm-pool claims (~line 1035) but claimSpecificWorker for takeover (~line 1199) skips it. A dead pod can be returned to the reservation flow.

5. Reconnect failure doesn't update durable record state server/flightsqlingress/ingress.go:~1735 — when ReconnectSession fails, the error is logged but the durable record stays active. The client will retry immediately with no backoff, hammering the reconnect path. Consider marking the record as expired/failed, or adding a retry-after hint.

6. Lease expiry not checked during query execution controlplane/org_reserved_pool.go:~269 — worker lease is set on claim (default 24h) but never refreshed during long-running queries. If a query exceeds the lease, the worker could theoretically be reassigned mid-execution. Low probability with 24h default, but worth a comment or a lease-refresh mechanism for safety.

7. Find vs Get inconsistency in store controlplane/configstore/store.go — GetFlightSessionRecord() returns an error on not-found, but FindFlightSessionRecord() returns (nil, nil). This inconsistency will confuse callers. Pick a convention (Go standard: Get errors, Find returns nil) and document it.

Low Severity

8. Drain blocks indefinitely on long-running queries server/flightsqlingress/ingress.go:~1664 — drain waits for all sessions to become idle. A session with a long-running query blocks drain until HandoverDrainTimeout (15m), at which point sessions are force-closed without updating durable records. Orphaned queries on the worker continue executing. Probably acceptable operationally but worth documenting.

9. No draining check on reconnect path server/flightsqlingress/ingress.go — during drain, new pgwire connections are rejected but Flight reconnects can still succeed. The reconnected session will be unable to do meaningful work since the CP is shutting down. Rejecting reconnects during drain would give clients a cleaner signal to connect to a new replica.

10. Timing-sensitive tests controlplane/runtime_tracker_test.go and controlplane/janitor_test.go use time.Sleep with narrow windows (10-25ms intervals, 250ms deadlines). These will be flaky under CI load. Consider channel-based synchronization or larger margins.

11. Resource limits may be tight k8s/control-plane-deployment.yaml — requests: 100m/128Mi, limits: 500m/256Mi. May be fine for a thin proxy but could be tight with leader election, janitor loops, and Flight session management under load. Worth monitoring.

Positives

Epoch fencing is thorough: Three-layer defense (database CAS, in-memory state, per-RPC validation) consistently applied across all worker control RPCs

SQL injection safe: All queries use parameterized inputs; schema names go through quoteIdentifier()

Advisory locks for capacity gating: Clean pattern for preventing over-provisioning across concurrent replicas

Reconnect uses OrgID from durable record, not current user: Prevents session hijacking across org boundaries

CI improvements are solid: ECR mirror, SHA-pinned kind images, retry logic all improve reliability

Manifest test (k8s_manifest_test.go) catching readiness probe regressions at build time is a nice touch

Runbooks are clear, accurate, and operationally useful

Label-based ownership is the right move over OwnerReferences for graceful drain support

Bottom Line

The core design is sound and the fencing mechanism is well-implemented where it matters most (the RPC boundary). The high-severity items (#1, #2, #3) are worth fixing before merge — they're all straightforward. The rest are improvements that could be addressed in follow-ups.

**1. it just fails, there is no retry, but agree that the error message could be more accurate - fixed.
**2. epoch fencing will reject one of the request;
**3. fixed
**4. fixed
**5. fixed
**6. current implementation retires workers when its session terminates, so lease expiry is actually redundant, removed
**7. cleaned up
**8. intended behavior, we want to have seamless deployment without terminating in-flight queries
**9. same as 8
**10. switched sleep to more robust methods
**11. the yaml is obsolete, removed

bill-ph added 27 commits March 26, 2026 12:58

Add runtime config store foundation

9b0d5da

Track control plane runtime lifecycle

458a439

Use label-based worker ownership metadata

ddb4746

Propagate durable control plane instance IDs

f03ebb5

Persist durable worker lifecycle records

dec14de

Fence worker control RPCs by owner epoch

6e3dea8

Add transactional idle worker claims

d2fa13d

Add leader-elected janitor for stale cp expiry

59a1282

Share the default worker auth secret across replicas

4bf7073

Track explicit worker pod names

5d01e4d

Claim and adopt idle runtime workers

f5eb8c6

Add runtime spawning slot allocator

ea2c503

Use runtime spawn slots on cold reservations

ebcb82d

Use runtime spawn slots for warm capacity

517f66d

Expand janitor runtime cleanup loops

b5764d2

Add durable Flight session reconnects

fb99436

Drain planned shutdowns before exit

231e022

Document remote control-plane rollout semantics

2f42346

Tighten durable Flight reconnect lifecycle

2e93cbb

Fence worker control paths by full owner identity

490d1e1

Hand over janitor leadership during drain

fa80cf7

Enforce org caps on runtime idle claims

b6da396

Return shared warm workers to neutral idle

c030c09

Keep tenant-touched workers disposable

0d7f80f

Make warm-pool maintenance leader-only

2e8200f

Add janitor max-drain timeout backstop

a1c218f

Route Flight reconnects by durable org

53c4b62

bill-ph changed the title ~~Add scalable multitenant control plane routing and activation~~ Implement scalable remote control-plane coordination Mar 27, 2026

bill-ph added 2 commits March 27, 2026 12:02

Use ECR mirror for CI Postgres services

9d41eaa

Stabilize kind image pulls in CI

23954b8

bill-ph added 4 commits March 27, 2026 12:28

Sanitize worker cp-instance labels

5d472f6

Make control plane instance IDs label-safe

90cbf7a

Align k8s integration tests with label ownership

a0e7a7d

Exercise crash-style cp deletion in k8s test

3678d90

bill-ph added 8 commits March 27, 2026 13:48

Fix janitor cleanup in kind k8s tests

d5b5d49

Return explicit stale takeover errors

8e802a1

Keep janitor cleanup running after expiry errors

dff7c3c

Recheck takeover liveness and classify reconnect failures

99821d3

Remove worker lease expiry from runtime state

55caae5

Unify flight session record lookup semantics

4448ad1

Stabilize janitor and runtime tracker tests

cdeb3f3

Remove obsolete control-plane manifest

0e3607b

bill-ph merged commit df1a3ce into main Mar 27, 2026
20 checks passed

bill-ph deleted the scalable-cp branch March 27, 2026 19:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement scalable remote control-plane coordination#374

Implement scalable remote control-plane coordination#374
bill-ph merged 41 commits intomainfrom
scalable-cp

bill-ph commented Mar 27, 2026 •

edited

Loading

Uh oh!

EDsCODE commented Mar 27, 2026

Uh oh!

bill-ph commented Mar 27, 2026

PR #374 Review: Scalable Remote Control-Plane Coordination

High Severity

Medium Severity

Low Severity

Positives

Bottom Line

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bill-ph commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior

Testing

More Details

Uh oh!

EDsCODE commented Mar 27, 2026

PR #374 Review: Scalable Remote Control-Plane Coordination

High Severity

Medium Severity

Low Severity

Positives

Bottom Line

Uh oh!

bill-ph commented Mar 27, 2026

PR #374 Review: Scalable Remote Control-Plane Coordination

High Severity

Medium Severity

Low Severity

Positives

Bottom Line

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bill-ph commented Mar 27, 2026 •

edited

Loading