Implement scalable remote control-plane coordination#374
Conversation
PR #374 Review: Scalable Remote Control-Plane Coordination+5468 / -419 across 55 files, 33 commits This is a substantial architectural change that moves from single-process control-plane coordination to horizontally scalable, database-backed runtime coordination. Overall the design is solid — epoch-based fencing, advisory locks for capacity gates, durable Flight reconnect, and leader-elected janitor cleanup are all well thought out. Here are my findings: High Severity1. 2. Concurrent Flight reconnect race can orphan worker sessions 3. Janitor early-return on first error skips remaining cleanup Medium Severity4. No liveness recheck on the takeover path 5. Reconnect failure doesn't update durable record state 6. Lease expiry not checked during query execution 7. Low Severity8. Drain blocks indefinitely on long-running queries 9. No draining check on reconnect path 10. Timing-sensitive tests 11. Resource limits may be tight Positives
Bottom LineThe core design is sound and the fencing mechanism is well-implemented where it matters most (the RPC boundary). The high-severity items (#1, #2, #3) are worth fixing before merge — they're all straightforward. The rest are improvements that could be addressed in follow-ups. |
**1. it just fails, there is no retry, but agree that the error message could be more accurate - fixed. |
Summary
Behavior
Testing
just test-unitjust test-controlplane-k8sjust test-configstore-integrationjust lintMore Details
Design
The current design is a horizontally scalable control plane with shared coordination in Postgres, but with a strict isolation boundary for tenant-touched workers.
Core model:
idlecapacity.owner_cp_instance_idowner_epochowner_epochis the fencing token. It prevents stale replicas from continuing to control a worker after ownership changes.owner_epochcp_instance_idworker_idImportant consequence:
Common Scenarios
idlerow in Postgres.FOR UPDATE SKIP LOCKED/ transactional update lets only one claim succeed.owner_epochand becomes the lease holder.Operational Semantics