Skip to content

feat: Add option to disable cluster websocket connection and fix instability#735

Open
maciaszczykm wants to merge 20 commits intomainfrom
marcin/prod-4573-deployment-operator-websocket-instability
Open

feat: Add option to disable cluster websocket connection and fix instability#735
maciaszczykm wants to merge 20 commits intomainfrom
marcin/prod-4573-deployment-operator-websocket-instability

Conversation

@maciaszczykm
Copy link
Copy Markdown
Member

@maciaszczykm maciaszczykm commented Apr 16, 2026

This change adds option to disable cluster websocket connection that was previously added in #684 and then reverted in #703 due to websocket instability observed in logs. GraphQL subscription is used now.

It will also fix PROD-4693.

Test Plan

Test environment: https://console.plrl-dev-aws.onplural.sh/
Git ref: marcin/prod-4573-deployment-operator-websocket-instability
Tag: sha-512bd48

Zrzut ekranu 2026-05-5 o 11 17 43

Checklist

  • I have added a meaningful title and summary to convey the impact of this PR to a user.
  • I have deployed the agent to a test environment and verified that it works as expected.
    • Agent starts successfully.
    • Service templating works correctly.
    • Service errors are reported properly and visible in the UI.
    • Service updates are reflected properly in the cluster.
    • Service resync triggers immediately and works as expected.
    • Sync waves annotations are respected.
    • Service component trees are working as expected.
    • Cluster health statuses are being updated.
    • Agent logs do not contain any errors (after running for at least 30 minutes).
    • There are no visible anomalies in Datadog (after running for at least 30 minutes).
  • I have added tests to cover my changes.
  • If required, I have updated the Plural documentation accordingly.

* add option to disable cluster websocket connection

* add option to disable websocket connection in console manager

* remove websocket configuration check and simplify console manager setup

* simplify console manager initialization by removing unnecessary parameters

* streamline socket connection handling in controller manager

* add Close method to Socket interface and implement in socket

* add option to disable websocket connection in controller manager

* disable websocket in agent configuration

* make socket methods thread-safe by adding mutex locks

* refactor startController method to improve readability

* add reconnect logic to websocket with `closed` state handling and improve thread safety

* improve comments in socket.go for clarity and consistency

* refactor websocket connection handling to improve clarity and prevent blocking

* refactor wssUri function to construct websocket URL more clearly and efficiently

* remove redundant comments

* enhance websocket connection handling with improved error messages and initialization

* refactor websocket event handling to improve clarity, error messaging, and validation

* introduce `joining` state to websocket for improved thread safety and prevent deadlock, and enable websocket in agent configuration

* refactor websocket reconnect logic for clarity, removing redundant `reconnect` method

* fix socket message handling to ensure proper mutex unlocking when socket is closed

* refactor

* adjust log levels

* ensure proper state management on connection failure

* ensure proper state management on connection failure

* update websocket disconnection logic

* enhance websocket disconnection handling with reconnect logic

* update websocket disconnection message and handle client closure

* add option to create new websocket client with generation tracking

* add clientReceiver to manage websocket client generation and prevent stale callbacks

* refactor notifyConnectLocked to return a boolean for Join() invocation

* Update pkg/websocket/client_receiver.go

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@linear
Copy link
Copy Markdown

linear Bot commented Apr 16, 2026

@maciaszczykm maciaszczykm changed the title feat: Add option to disable cluster websocket connection (#684) feat: Add option to disable cluster websocket connection Apr 16, 2026
…t gophoenix own transport reconnects to prevent protocol error churn
@maciaszczykm
Copy link
Copy Markdown
Member Author

@greptileai

@github-actions github-actions Bot added size/XL and removed size/L labels Apr 16, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 16, 2026

Greptile Summary

This PR replaces the Phoenix-based WebSocket (gophoenix) with a hasura/go-graphql-client GraphQL subscription, adds a disableWebsocket toggle to AgentConfigurationSpec, and fixes several instability issues in the previous implementation.

  • WebSocket rewrite: The new socket struct uses atomic operations (joining, closed, clientHandle, clientGen) to manage the subscription-client lifecycle safely under concurrent Join()/Close() calls. The outer poll loop in controller_manager.go drives reconnect rather than relying on internal library retries. All publisher keys are renamed from \"service.event\"-style strings to bare resource names (\"service\", \"stack_run\", etc.) to match the GraphQL notification resource field.
  • disableWebsocket feature: WithSocketArgs now creates a closed socket via NewClosed() so the poll loop—not the constructor—decides when to open the connection, giving AgentConfiguration reconciliation time to apply the flag before any network activity. When disabled, each poll tick calls Close() (which is idempotent); when re-enabled, the next tick calls Join() and reconnects.
  • agentruntime_controller.go hardening: Added nil guard on run.Runtime, propagates non-not-found errors from the pre-create Get, and handles AlreadyExists on Create to survive concurrent writes.

Confidence Score: 5/5

Safe to merge; the concurrent state machine is well-reasoned and the test suite covers the key race scenarios.

The socket rewrite carefully handles all identified concurrent access patterns: the publishClient double-check correctly guards the Close/Join race window, stale generations prevent old goroutines from interfering with new connections, and the poll-driven reconnect loop gives clean separation of concerns. The agentruntime controller changes are strictly defensive improvements. No data-loss or incorrect-behavior paths were found.

No files require special attention; the concurrent logic in pkg/websocket/socket.go is the most complex part of the change but appears correct.

Important Files Changed

Filename Overview
pkg/websocket/socket.go Complete rewrite replacing gophoenix Phoenix WebSocket with hasura go-graphql-client GraphQL subscription; atomic state machine for joining/closed/clientHandle is carefully implemented with correct race-condition handling
pkg/controller/controller_manager.go Poll loop now checks IsWebsocketDisabled() each tick, calling Close() or Join() accordingly; idempotent Close() prevents log spam on repeated ticks while disabled
pkg/controller/controller_manager_options.go WithSocketArgs now creates a closed socket via NewClosed(), deferring connection to the poll loop so AgentConfiguration reconciliation can apply disableWebsocket before any connection is attempted
pkg/websocket/socket_test.go New test file covering closed-start behaviour, URL validation, notification dispatch, stale-generation guard, nil-publisher rejection, concurrent Join/Close safety, and GraphQL field-name correctness
internal/controller/agentruntime_controller.go Improved error handling: nil runtime guard, non-not-found Get errors now propagated, AlreadyExists on Create handled; publisher key updated from agent.run.event to agent_run
api/v1alpha1/agentconfig_types.go Adds optional *bool DisableWebsocket field to AgentConfigurationSpec; pointer semantics correctly distinguish nil/unset from explicit false
pkg/common/config.go Adds disableWebsocket field and IsWebsocketDisabled() reader protected by the existing RWMutex; nil means not set which equals websocket enabled
pkg/controller/test/common.go FakeSocket updated to match slimmed Socket interface: removed NotifyConnect/Disconnect and OnXxx methods, added Close()
cmd/agent/kubernetes.go Publisher key for AgentRuntime corrected from agent.run.event to agent_run to match the GraphQL notification resource name

Reviews (9): Last reviewed commit: "update other keys" | Re-trigger Greptile

Comment thread pkg/websocket/socket.go Outdated
Comment thread pkg/controller/controller_manager.go
@maciaszczykm maciaszczykm changed the title feat: Add option to disable cluster websocket connection feat: Add option to disable cluster websocket connection and fix instability Apr 16, 2026
@maciaszczykm
Copy link
Copy Markdown
Member Author

@greptileai

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a runtime configuration flag to disable the cluster websocket connection and updates the controller manager + websocket implementation to behave safely under reconnect/close scenarios (addressing prior instability that led to reverts).

Changes:

  • Introduces disableWebsocket in AgentConfigurationSpec, propagates it into the runtime ConfigurationManager, and documents it (CRDs + docs + samples).
  • Updates controller manager startup/poll loop to close the websocket and skip joining when disableWebsocket is enabled, and to start with a “closed” socket so runtime config can take effect before connecting.
  • Refactors websocket socket implementation for safer reconnect/close behavior (generation-gated callbacks, locking) and adds unit tests.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/websocket/socket.go Adds closed-start socket option, reconnect/close semantics, and thread-safety around callbacks/joining
pkg/websocket/client_receiver.go Introduces generation-scoped receiver to ignore stale client callbacks
pkg/websocket/socket_test.go Adds tests covering closed start, stale callback ignoring, and callback storms
pkg/controller/controller_manager_options.go Switches manager socket init to NewClosed so runtime config can decide whether to connect
pkg/controller/controller_manager.go Skips websocket join + closes socket when disableWebsocket is enabled
pkg/common/config.go Stores and exposes disableWebsocket via IsWebsocketDisabled()
api/v1alpha1/agentconfig_types.go Adds DisableWebsocket field to AgentConfigurationSpec
api/v1alpha1/zz_generated.deepcopy.go Deepcopy support for DisableWebsocket
config/crd/bases/deployments.plural.sh_agentconfigurations.yaml CRD schema updated with disableWebsocket
charts/deployment-operator/crds/deployments.plural.sh_agentconfigurations.yaml Helm CRD schema updated with disableWebsocket
config/samples/agentConfiguration.yaml Sample updated with disableWebsocket
docs/api.md API docs updated to include disableWebsocket

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/websocket/socket.go Outdated
Comment thread pkg/websocket/socket.go Outdated
Comment thread pkg/websocket/socket.go Outdated
Comment thread pkg/websocket/socket.go Outdated
@maciaszczykm
Copy link
Copy Markdown
Member Author

@greptileai

@maciaszczykm
Copy link
Copy Markdown
Member Author

@greptileai

@socket-security
Copy link
Copy Markdown

socket-security Bot commented Apr 17, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedgithub.com/​hasura/​go-graphql-client@​v0.16.097100100100100

View full report

@maciaszczykm maciaszczykm force-pushed the marcin/prod-4573-deployment-operator-websocket-instability branch 5 times, most recently from 33f1b7f to 029cc81 Compare April 17, 2026 12:23
@maciaszczykm maciaszczykm force-pushed the marcin/prod-4573-deployment-operator-websocket-instability branch from 029cc81 to c2c02c7 Compare April 17, 2026 12:26
@maciaszczykm maciaszczykm force-pushed the marcin/prod-4573-deployment-operator-websocket-instability branch from c2c02c7 to 032650c Compare April 17, 2026 12:42
@maciaszczykm
Copy link
Copy Markdown
Member Author

@greptileai

@maciaszczykm
Copy link
Copy Markdown
Member Author

@greptileai

@maciaszczykm maciaszczykm marked this pull request as ready for review May 4, 2026 12:38
@maciaszczykm maciaszczykm requested a review from a team as a code owner May 4, 2026 12:38
Comment thread pkg/websocket/socket.go
@maciaszczykm
Copy link
Copy Markdown
Member Author

@greptileai

@maciaszczykm maciaszczykm marked this pull request as draft May 4, 2026 13:28
@maciaszczykm maciaszczykm marked this pull request as ready for review May 5, 2026 09:05
@maciaszczykm
Copy link
Copy Markdown
Member Author

@greptileai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants