feat: Add option to disable cluster websocket connection and fix instability#735
feat: Add option to disable cluster websocket connection and fix instability#735maciaszczykm wants to merge 20 commits intomainfrom
Conversation
* add option to disable cluster websocket connection * add option to disable websocket connection in console manager * remove websocket configuration check and simplify console manager setup * simplify console manager initialization by removing unnecessary parameters * streamline socket connection handling in controller manager * add Close method to Socket interface and implement in socket * add option to disable websocket connection in controller manager * disable websocket in agent configuration * make socket methods thread-safe by adding mutex locks * refactor startController method to improve readability * add reconnect logic to websocket with `closed` state handling and improve thread safety * improve comments in socket.go for clarity and consistency * refactor websocket connection handling to improve clarity and prevent blocking * refactor wssUri function to construct websocket URL more clearly and efficiently * remove redundant comments * enhance websocket connection handling with improved error messages and initialization * refactor websocket event handling to improve clarity, error messaging, and validation * introduce `joining` state to websocket for improved thread safety and prevent deadlock, and enable websocket in agent configuration * refactor websocket reconnect logic for clarity, removing redundant `reconnect` method * fix socket message handling to ensure proper mutex unlocking when socket is closed * refactor * adjust log levels * ensure proper state management on connection failure * ensure proper state management on connection failure * update websocket disconnection logic * enhance websocket disconnection handling with reconnect logic * update websocket disconnection message and handle client closure * add option to create new websocket client with generation tracking * add clientReceiver to manage websocket client generation and prevent stale callbacks * refactor notifyConnectLocked to return a boolean for Join() invocation * Update pkg/websocket/client_receiver.go Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…t gophoenix own transport reconnects to prevent protocol error churn
Greptile SummaryThis PR replaces the Phoenix-based WebSocket (
Confidence Score: 5/5Safe to merge; the concurrent state machine is well-reasoned and the test suite covers the key race scenarios. The socket rewrite carefully handles all identified concurrent access patterns: the publishClient double-check correctly guards the Close/Join race window, stale generations prevent old goroutines from interfering with new connections, and the poll-driven reconnect loop gives clean separation of concerns. The agentruntime controller changes are strictly defensive improvements. No data-loss or incorrect-behavior paths were found. No files require special attention; the concurrent logic in pkg/websocket/socket.go is the most complex part of the change but appears correct.
|
| Filename | Overview |
|---|---|
| pkg/websocket/socket.go | Complete rewrite replacing gophoenix Phoenix WebSocket with hasura go-graphql-client GraphQL subscription; atomic state machine for joining/closed/clientHandle is carefully implemented with correct race-condition handling |
| pkg/controller/controller_manager.go | Poll loop now checks IsWebsocketDisabled() each tick, calling Close() or Join() accordingly; idempotent Close() prevents log spam on repeated ticks while disabled |
| pkg/controller/controller_manager_options.go | WithSocketArgs now creates a closed socket via NewClosed(), deferring connection to the poll loop so AgentConfiguration reconciliation can apply disableWebsocket before any connection is attempted |
| pkg/websocket/socket_test.go | New test file covering closed-start behaviour, URL validation, notification dispatch, stale-generation guard, nil-publisher rejection, concurrent Join/Close safety, and GraphQL field-name correctness |
| internal/controller/agentruntime_controller.go | Improved error handling: nil runtime guard, non-not-found Get errors now propagated, AlreadyExists on Create handled; publisher key updated from agent.run.event to agent_run |
| api/v1alpha1/agentconfig_types.go | Adds optional *bool DisableWebsocket field to AgentConfigurationSpec; pointer semantics correctly distinguish nil/unset from explicit false |
| pkg/common/config.go | Adds disableWebsocket field and IsWebsocketDisabled() reader protected by the existing RWMutex; nil means not set which equals websocket enabled |
| pkg/controller/test/common.go | FakeSocket updated to match slimmed Socket interface: removed NotifyConnect/Disconnect and OnXxx methods, added Close() |
| cmd/agent/kubernetes.go | Publisher key for AgentRuntime corrected from agent.run.event to agent_run to match the GraphQL notification resource name |
Reviews (9): Last reviewed commit: "update other keys" | Re-trigger Greptile
There was a problem hiding this comment.
Pull request overview
Adds a runtime configuration flag to disable the cluster websocket connection and updates the controller manager + websocket implementation to behave safely under reconnect/close scenarios (addressing prior instability that led to reverts).
Changes:
- Introduces
disableWebsocketinAgentConfigurationSpec, propagates it into the runtimeConfigurationManager, and documents it (CRDs + docs + samples). - Updates controller manager startup/poll loop to close the websocket and skip joining when
disableWebsocketis enabled, and to start with a “closed” socket so runtime config can take effect before connecting. - Refactors websocket socket implementation for safer reconnect/close behavior (generation-gated callbacks, locking) and adds unit tests.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/websocket/socket.go | Adds closed-start socket option, reconnect/close semantics, and thread-safety around callbacks/joining |
| pkg/websocket/client_receiver.go | Introduces generation-scoped receiver to ignore stale client callbacks |
| pkg/websocket/socket_test.go | Adds tests covering closed start, stale callback ignoring, and callback storms |
| pkg/controller/controller_manager_options.go | Switches manager socket init to NewClosed so runtime config can decide whether to connect |
| pkg/controller/controller_manager.go | Skips websocket join + closes socket when disableWebsocket is enabled |
| pkg/common/config.go | Stores and exposes disableWebsocket via IsWebsocketDisabled() |
| api/v1alpha1/agentconfig_types.go | Adds DisableWebsocket field to AgentConfigurationSpec |
| api/v1alpha1/zz_generated.deepcopy.go | Deepcopy support for DisableWebsocket |
| config/crd/bases/deployments.plural.sh_agentconfigurations.yaml | CRD schema updated with disableWebsocket |
| charts/deployment-operator/crds/deployments.plural.sh_agentconfigurations.yaml | Helm CRD schema updated with disableWebsocket |
| config/samples/agentConfiguration.yaml | Sample updated with disableWebsocket |
| docs/api.md | API docs updated to include disableWebsocket |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
33f1b7f to
029cc81
Compare
029cc81 to
c2c02c7
Compare
c2c02c7 to
032650c
Compare
This change adds option to disable cluster websocket connection that was previously added in #684 and then reverted in #703 due to websocket instability observed in logs. GraphQL subscription is used now.
It will also fix PROD-4693.
Test Plan
Test environment: https://console.plrl-dev-aws.onplural.sh/
Git ref: marcin/prod-4573-deployment-operator-websocket-instability
Tag: sha-512bd48
Checklist