Skip to content

Implement scalable remote control-plane coordination#374

Merged
bill-ph merged 41 commits intomainfrom
scalable-cp
Mar 27, 2026
Merged

Implement scalable remote control-plane coordination#374
bill-ph merged 41 commits intomainfrom
scalable-cp

Conversation

@bill-ph
Copy link
Copy Markdown
Collaborator

@bill-ph bill-ph commented Mar 27, 2026

Summary

  • add durable runtime coordination for scalable remote control planes, including control-plane instance tracking, worker records, and ownership fencing
  • make shared warm-pool maintenance leader-only and cluster-wide, with janitor cleanup for stale replicas, orphaned workers, stuck workers, and overdue draining replicas
  • harden Flight reconnect and worker control semantics by routing reconnects through the durable org binding and fencing worker operations with durable owner metadata

Behavior

  • multiple control-plane replicas can now coordinate worker claims and background cleanup through the runtime config-store tables
  • pgwire connections still drop on control-plane failure; Flight reconnect may recover when the worker survives and the durable token remains valid
  • neutral warm workers are shared capacity, but tenant-activated workers remain disposable and are retired after use

Testing

  • just test-unit
  • just test-controlplane-k8s
  • just test-configstore-integration
  • just lint

More Details

Design
The current design is a horizontally scalable control plane with shared coordination in Postgres, but with a strict isolation boundary for tenant-touched workers.

Core model:

  • Multiple control-plane replicas may serve traffic.
  • Runtime coordination lives in the config-store Postgres DB, not in one replica’s memory.
  • Shared warm workers start as neutral idle capacity.
  • A control-plane replica claims an idle worker transactionally, which creates a temporary lease:
    • set org assignment
    • set owner_cp_instance_id
    • bump owner_epoch
  • owner_epoch is the fencing token. It prevents stale replicas from continuing to control a worker after ownership changes.
  • Worker RPCs validate full ownership metadata, not just epoch:
    • owner_epoch
    • cp_instance_id
    • worker_id
  • Once a worker has been activated for a tenant, it is disposable.
  • After its last session ends, it is retired, not returned to the shared pool.
  • New neutral warm capacity comes from spawning fresh workers, not cross-tenant recycling.
  • One elected janitor replica runs background reconciliation:
    • expire stale control-plane instances
    • retire orphaned/stuck workers
    • maintain warm capacity
    • prune expired Flight reconnect state

Important consequence:

  • Idle warm workers are shared pool resources.
  • Leased workers are temporarily owned.
  • Tenant-touched workers are never reused across tenants.

Common Scenarios

  1. Shared warm pool steady state
  • The janitor keeps some number of neutral idle workers available.
  • These workers have no meaningful live tenant binding.
  • They exist only to reduce cold-start latency.
  1. New pgwire session
  • A client connects to any control-plane replica.
  • The replica tries to claim an idle worker from the shared pool using a Postgres transaction.
  • If it wins a claim, it sets ownership metadata and reserves that worker for the target org.
  • The worker is activated for that tenant runtime.
  • The control plane creates a worker-local session and proxies the client.
  • When the last session ends, that worker is retired.
  • Replacement warm capacity is replenished separately by spawning another neutral worker.
  1. New Flight session
  • Same worker-claim path as pgwire.
  • The control plane also writes a durable Flight reconnect record.
  • That record binds the session token to:
    • org
    • worker
    • owner metadata
    • TTL
  • If the serving control plane stays healthy, the session behaves normally.
  1. Two control planes race for the same idle worker
  • Both try to claim an idle row in Postgres.
  • FOR UPDATE SKIP LOCKED / transactional update lets only one claim succeed.
  • The winner bumps owner_epoch and becomes the lease holder.
  • Even if the loser had stale local knowledge, worker-side fencing rejects stale RPCs later.
  1. Control-plane crash during pgwire
  • The client connection dies.
  • Existing pgwire session is lost.
  • The worker may still be alive for a short time, but that does not preserve the pgwire session.
  • Janitor or takeover logic retires or reconciles(flight) the worker.
  • Client reconnects through another control-plane replica.
  1. Control-plane crash during Flight
  • The gRPC stream dies.
  • The client reconnects with the Flight token before TTL expiry.
  • The new control-plane replica looks up the durable reconnect record.
  • If the worker still exists and takeover is safe, it can rebind to the same worker by taking ownership with a newer epoch.
  • This is reconnect-and-rebind, not transparent continuation of the old stream.
  • If takeover is not possible, the reconnect fails and the client must start a new session.
  1. Planned rollout
  • A replica enters draining mode.
  • It stops taking new work.
  • Existing sessions are allowed to continue up to the configured drain timeout.
  • Because this is planned, active sessions should usually survive by remaining pinned to their current control plane until completion.
  • After timeout or full drain, the old replica exits.
  • New replicas take new traffic.
  • This is different from crash failover, where connections are lost.
  1. Orphaned worker cleanup
  • If a control-plane replica disappears without cleaning up, some workers may still be running.
  • The janitor notices the owner replica’s heartbeat expired.
  • It marks those workers orphaned/stale and retires them safely.
  • This prevents leaked capacity and split-brain control.
  1. Org worker cap enforcement
  • Org caps are enforced both when spawning new workers and when claiming warm idle workers.
  • That prevents multiple replicas from exceeding the per-org worker limit just by racing on shared warm capacity.

Operational Semantics

  • Horizontal scalability means multiple control planes can accept new traffic.
  • It does not mean existing pgwire connections survive control-plane failure.
  • Isolation is favored over aggressive worker reuse.
  • Shared pool reuse is only for never-activated neutral workers.
  • Tenant-activated workers are single-tenant and disposable after use.

@bill-ph bill-ph changed the title Add scalable multitenant control plane routing and activation Implement scalable remote control-plane coordination Mar 27, 2026
@EDsCODE
Copy link
Copy Markdown
Contributor

EDsCODE commented Mar 27, 2026

PR #374 Review: Scalable Remote Control-Plane Coordination

+5468 / -419 across 55 files, 33 commits

This is a substantial architectural change that moves from single-process control-plane coordination to horizontally scalable, database-backed runtime coordination. Overall the design is solid — epoch-based fencing, advisory locks for capacity gates, durable Flight reconnect, and leader-elected janitor cleanup are all well thought out. Here are my findings:


High Severity

1. TakeOverWorker returns nil on epoch mismatch — indistinguishable from "no worker found"
controlplane/configstore/store.go — when the epoch check fails, the function returns nil, nil (no error). Callers like claimSpecificWorker in k8s_pool.go can't distinguish "stale epoch" from "worker doesn't exist", which could cause infinite retry loops on a permanently-stale epoch. Consider returning a sentinel error (e.g., ErrEpochMismatch) so callers can break out.

2. Concurrent Flight reconnect race can orphan worker sessions
server/flightsqlingress/ingress.go:~1748 — if a client sends two reconnect requests simultaneously, both can pass the durable store lookup, both call ReconnectSession, both get new pids, and the second overwrites the first in s.sessions[token]. The first pid's worker session becomes orphaned. A sync.Mutex or compare-and-swap on the token during reconnect registration would fix this.

3. Janitor early-return on first error skips remaining cleanup
controlplane/janitor.go:~80runOnce() returns on the first error from ExpireControlPlaneInstances. If that fails (e.g., transient DB error), orphaned worker cleanup, stuck worker cleanup, warm capacity reconciliation, and Flight session expiry all get skipped for that cycle. Consider logging and continuing rather than returning early, or at minimum running all cleanup steps independently.


Medium Severity

4. No liveness recheck on the takeover path
controlplane/k8s_pool.goreserveClaimedWorker does a liveness recheck for warm-pool claims (~line 1035) but claimSpecificWorker for takeover (~line 1199) skips it. A dead pod can be returned to the reservation flow.

5. Reconnect failure doesn't update durable record state
server/flightsqlingress/ingress.go:~1735 — when ReconnectSession fails, the error is logged but the durable record stays active. The client will retry immediately with no backoff, hammering the reconnect path. Consider marking the record as expired/failed, or adding a retry-after hint.

6. Lease expiry not checked during query execution
controlplane/org_reserved_pool.go:~269 — worker lease is set on claim (default 24h) but never refreshed during long-running queries. If a query exceeds the lease, the worker could theoretically be reassigned mid-execution. Low probability with 24h default, but worth a comment or a lease-refresh mechanism for safety.

7. Find vs Get inconsistency in store
controlplane/configstore/store.goGetFlightSessionRecord() returns an error on not-found, but FindFlightSessionRecord() returns (nil, nil). This inconsistency will confuse callers. Pick a convention (Go standard: Get errors, Find returns nil) and document it.


Low Severity

8. Drain blocks indefinitely on long-running queries
server/flightsqlingress/ingress.go:~1664 — drain waits for all sessions to become idle. A session with a long-running query blocks drain until HandoverDrainTimeout (15m), at which point sessions are force-closed without updating durable records. Orphaned queries on the worker continue executing. Probably acceptable operationally but worth documenting.

9. No draining check on reconnect path
server/flightsqlingress/ingress.go — during drain, new pgwire connections are rejected but Flight reconnects can still succeed. The reconnected session will be unable to do meaningful work since the CP is shutting down. Rejecting reconnects during drain would give clients a cleaner signal to connect to a new replica.

10. Timing-sensitive tests
controlplane/runtime_tracker_test.go and controlplane/janitor_test.go use time.Sleep with narrow windows (10-25ms intervals, 250ms deadlines). These will be flaky under CI load. Consider channel-based synchronization or larger margins.

11. Resource limits may be tight
k8s/control-plane-deployment.yamlrequests: 100m/128Mi, limits: 500m/256Mi. May be fine for a thin proxy but could be tight with leader election, janitor loops, and Flight session management under load. Worth monitoring.


Positives

  • Epoch fencing is thorough: Three-layer defense (database CAS, in-memory state, per-RPC validation) consistently applied across all worker control RPCs
  • SQL injection safe: All queries use parameterized inputs; schema names go through quoteIdentifier()
  • Advisory locks for capacity gating: Clean pattern for preventing over-provisioning across concurrent replicas
  • Reconnect uses OrgID from durable record, not current user: Prevents session hijacking across org boundaries
  • CI improvements are solid: ECR mirror, SHA-pinned kind images, retry logic all improve reliability
  • Manifest test (k8s_manifest_test.go) catching readiness probe regressions at build time is a nice touch
  • Runbooks are clear, accurate, and operationally useful
  • Label-based ownership is the right move over OwnerReferences for graceful drain support

Bottom Line

The core design is sound and the fencing mechanism is well-implemented where it matters most (the RPC boundary). The high-severity items (#1, #2, #3) are worth fixing before merge — they're all straightforward. The rest are improvements that could be addressed in follow-ups.

@bill-ph
Copy link
Copy Markdown
Collaborator Author

bill-ph commented Mar 27, 2026

PR #374 Review: Scalable Remote Control-Plane Coordination

+5468 / -419 across 55 files, 33 commits

This is a substantial architectural change that moves from single-process control-plane coordination to horizontally scalable, database-backed runtime coordination. Overall the design is solid — epoch-based fencing, advisory locks for capacity gates, durable Flight reconnect, and leader-elected janitor cleanup are all well thought out. Here are my findings:

High Severity

1. TakeOverWorker returns nil on epoch mismatch — indistinguishable from "no worker found" controlplane/configstore/store.go — when the epoch check fails, the function returns nil, nil (no error). Callers like claimSpecificWorker in k8s_pool.go can't distinguish "stale epoch" from "worker doesn't exist", which could cause infinite retry loops on a permanently-stale epoch. Consider returning a sentinel error (e.g., ErrEpochMismatch) so callers can break out.

2. Concurrent Flight reconnect race can orphan worker sessions server/flightsqlingress/ingress.go:~1748 — if a client sends two reconnect requests simultaneously, both can pass the durable store lookup, both call ReconnectSession, both get new pids, and the second overwrites the first in s.sessions[token]. The first pid's worker session becomes orphaned. A sync.Mutex or compare-and-swap on the token during reconnect registration would fix this.

3. Janitor early-return on first error skips remaining cleanup controlplane/janitor.go:~80runOnce() returns on the first error from ExpireControlPlaneInstances. If that fails (e.g., transient DB error), orphaned worker cleanup, stuck worker cleanup, warm capacity reconciliation, and Flight session expiry all get skipped for that cycle. Consider logging and continuing rather than returning early, or at minimum running all cleanup steps independently.

Medium Severity

4. No liveness recheck on the takeover path controlplane/k8s_pool.goreserveClaimedWorker does a liveness recheck for warm-pool claims (~line 1035) but claimSpecificWorker for takeover (~line 1199) skips it. A dead pod can be returned to the reservation flow.

5. Reconnect failure doesn't update durable record state server/flightsqlingress/ingress.go:~1735 — when ReconnectSession fails, the error is logged but the durable record stays active. The client will retry immediately with no backoff, hammering the reconnect path. Consider marking the record as expired/failed, or adding a retry-after hint.

6. Lease expiry not checked during query execution controlplane/org_reserved_pool.go:~269 — worker lease is set on claim (default 24h) but never refreshed during long-running queries. If a query exceeds the lease, the worker could theoretically be reassigned mid-execution. Low probability with 24h default, but worth a comment or a lease-refresh mechanism for safety.

7. Find vs Get inconsistency in store controlplane/configstore/store.goGetFlightSessionRecord() returns an error on not-found, but FindFlightSessionRecord() returns (nil, nil). This inconsistency will confuse callers. Pick a convention (Go standard: Get errors, Find returns nil) and document it.

Low Severity

8. Drain blocks indefinitely on long-running queries server/flightsqlingress/ingress.go:~1664 — drain waits for all sessions to become idle. A session with a long-running query blocks drain until HandoverDrainTimeout (15m), at which point sessions are force-closed without updating durable records. Orphaned queries on the worker continue executing. Probably acceptable operationally but worth documenting.

9. No draining check on reconnect path server/flightsqlingress/ingress.go — during drain, new pgwire connections are rejected but Flight reconnects can still succeed. The reconnected session will be unable to do meaningful work since the CP is shutting down. Rejecting reconnects during drain would give clients a cleaner signal to connect to a new replica.

10. Timing-sensitive tests controlplane/runtime_tracker_test.go and controlplane/janitor_test.go use time.Sleep with narrow windows (10-25ms intervals, 250ms deadlines). These will be flaky under CI load. Consider channel-based synchronization or larger margins.

11. Resource limits may be tight k8s/control-plane-deployment.yamlrequests: 100m/128Mi, limits: 500m/256Mi. May be fine for a thin proxy but could be tight with leader election, janitor loops, and Flight session management under load. Worth monitoring.

Positives

  • Epoch fencing is thorough: Three-layer defense (database CAS, in-memory state, per-RPC validation) consistently applied across all worker control RPCs
  • SQL injection safe: All queries use parameterized inputs; schema names go through quoteIdentifier()
  • Advisory locks for capacity gating: Clean pattern for preventing over-provisioning across concurrent replicas
  • Reconnect uses OrgID from durable record, not current user: Prevents session hijacking across org boundaries
  • CI improvements are solid: ECR mirror, SHA-pinned kind images, retry logic all improve reliability
  • Manifest test (k8s_manifest_test.go) catching readiness probe regressions at build time is a nice touch
  • Runbooks are clear, accurate, and operationally useful
  • Label-based ownership is the right move over OwnerReferences for graceful drain support

Bottom Line

The core design is sound and the fencing mechanism is well-implemented where it matters most (the RPC boundary). The high-severity items (#1, #2, #3) are worth fixing before merge — they're all straightforward. The rest are improvements that could be addressed in follow-ups.

**1. it just fails, there is no retry, but agree that the error message could be more accurate - fixed.
**2. epoch fencing will reject one of the request;
**3. fixed
**4. fixed
**5. fixed
**6. current implementation retires workers when its session terminates, so lease expiry is actually redundant, removed
**7. cleaned up
**8. intended behavior, we want to have seamless deployment without terminating in-flight queries
**9. same as 8
**10. switched sleep to more robust methods
**11. the yaml is obsolete, removed

@bill-ph bill-ph merged commit df1a3ce into main Mar 27, 2026
20 checks passed
@bill-ph bill-ph deleted the scalable-cp branch March 27, 2026 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants