fix(OBS-2257): token refresh failure causes agent to become stale#287
fix(OBS-2257): token refresh failure causes agent to become stale#287jajeffries merged 9 commits intodevelopfrom
Conversation
…ect worker When token refresh fails the reconnect goroutine previously logged the error and immediately waited for the next monitor tick (up to 30s), leaving the MQTT connection alive with an expired token. The broker silently rejects heartbeats, causing the server to mark the agent Stale while the agent process continues running. Three fixes applied in a new runReconnectWorker method: 1. Exponential backoff retry: up to 5 attempts (5s→10s→20s→40s→2min cap) before giving up, respecting ctx cancellation between sleeps. 2. Explicit Disconnect after exhaustion: stops the heartbeat goroutine so the server sees the agent as Offline rather than Stale. 3. Fast re-signal via time.AfterFunc: schedules a signal on reconnectChan 30s after exhaustion so recovery does not wait for the next full monitor tick. Timing parameters are injected via the method signature so tests can use millisecond-scale values without slowing the test suite. Made-with: Cursor
… re-signal Adds a controlledTokenServer test helper that can be configured to fail for N requests before returning a valid token, enabling precise control over the number of retry attempts the worker makes. Three new test cases cover the three fix scenarios: - RetriesOnTransientFailure: 2 transient failures then success; asserts Disconnect is NOT called and no re-signal is scheduled. - DisconnectsAfterAllRetriesFail: permanent failure; asserts Disconnect is called after all retries are exhausted. - ReschedulesAfterExhaustion: permanent failure; asserts reconnectChan receives a re-signal after the retry delay window. Also adds DisconnectCalled tracking to MockMQTTConnection to support the new assertions. Made-with: Cursor
|
Go test coverage
Total coverage: 60.1% |
Add sync.RWMutex to AuthTokenManager to guard concurrent reads and writes between the reconnect worker goroutine (writer via GetToken) and the token expiry monitor goroutine (readers via IsTokenExpired, IsTokenExpiringSoon, GetTokenExpiryTime). The HTTP call in GetToken is intentionally kept outside the lock to avoid blocking readers during I/O; all struct field writes are batched into a single critical section at the end. RefreshToken snapshots credentials under a read lock before calling GetToken. Add TestAuthTokenManager_ConcurrentAccess to exercise the race detector: 4 writer and 4 reader goroutines run concurrently for 100ms; the test fails under go test -race without the mutex. Made-with: Cursor
Pass monitorCtx to runReconnectWorker so Stop()'s monitorCancel() also terminates the reconnect goroutine. Replace the for-range loop (which only exits when the channel is closed) with a for/select on ctx.Done() and reconnectChan so the worker exits promptly on shutdown. Made-with: Cursor
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b842c3f184
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The timer created by time.AfterFunc in runReconnectWorker was discarded, so its callback could fire after Stop() cancelled monitorCtx — sending on reconnectChan and logging after the manager was logically shut down. Guard the callback with a ctx.Err() check: if the context has been cancelled, the callback returns immediately without side effects. Made-with: Cursor
The reset goroutine in Start() captured local variables (connectionDetails, backends, cfg.OrbAgent.Labels, configYaml, topics) by closure at startup. After a successful JWT refresh, refreshAndReconnect updated fleetManager.connectionDetails and related fields, but the reset goroutine continued reconnecting with the stale initial values. Switch all closure references to fleetManager.* struct fields so the goroutine always uses the most recently refreshed token and topics. Also capture ConnectionDetails in MockMQTTConnection.LastConnectDetails to enable assertion of which details were passed to Connect in tests. Made-with: Cursor
…ionDetails - TestFleetConfigManager_ReconnectWorker_AfterFuncSuppressedOnContextCancel: exhausts retries, cancels the context before retryDelay elapses, and asserts that no re-signal is sent on reconnectChan after shutdown. - TestFleetConfigManager_ResetGoroutine_UsesLatestConnectionDetails: primes the manager with an initial token, simulates a token refresh by updating fleetManager.connectionDetails, triggers a reset, and asserts that Connect was called with the refreshed token rather than the stale one. Made-with: Cursor
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fc4d019324
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Two P1 issues flagged by the automated Codex review: 1. Bound Disconnect with a timeout in runReconnectWorker The Disconnect call previously used the long-lived monitorCtx. If the MQTT broker hung during teardown the call could block indefinitely, preventing the time.AfterFunc re-signal from ever being scheduled. Fixed by creating a context.WithTimeout(context.Background(), timeout) for the Disconnect call, consistent with the existing reset goroutine. 2. Synchronise connectionDetails across goroutines refreshAndReconnect (reconnect worker goroutine) wrote fleetManager.connectionDetails while the reset goroutine could read it concurrently — a data race. Fixed by adding connMu sync.RWMutex to FleetConfigManager: writes in refreshAndReconnect are guarded by connMu.Lock(); reads in the reset goroutine and runReconnectWorker snapshot the value under connMu.RLock() before use. Also promotes ConnectCalled, DisconnectCalled, and LastConnectDetails on MockMQTTConnection to mutex-guarded accessor methods, eliminating a secondary race flagged by go test -race between the goroutine under test and the require.Eventually poller in tests. Made-with: Cursor
This pull request introduces concurrency safety improvements to the
FleetConfigManagerandAuthTokenManagercomponents, ensuring thread-safe access to shared state during agent operation and token refresh workflows. It also updates the mock implementations and associated tests to support concurrent access, preventing data races and improving reliability under multi-goroutine scenarios.Concurrency safety enhancements:
sync.RWMutexfields (connMuinFleetConfigManager,muinAuthTokenManager) to guard shared state, ensuring consistent reads and writes across goroutines. All access to critical fields such asconnectionDetailsand token-related fields is now protected by appropriate locks. [1] [2]AuthTokenManagermethods (GetToken,RefreshToken,IsTokenExpired,IsTokenExpiringSoon,GetTokenExpiryTime) to use read and write locks, preventing race conditions during concurrent token refresh and expiry checks.Agent connection and token refresh workflow improvements:
FleetConfigManager.Startto snapshot connection details under a read lock, and to use the latest details during reconnects, ensuring atomicity and consistency during MQTT connection resets.runReconnectWorkergoroutine with exponential backoff and retry logic for token refresh and reconnection, including proper teardown and fast retry signaling after failed attempts. This worker is coordinated via a shared cancellable context for clean shutdown.Testing and mock improvements:
MockMQTTConnectionwith mutex-protected state and accessor methods (ConnectCalled,DisconnectCalled,LastConnectDetails) to support concurrent test scenarios.TestAuthTokenManager_ConcurrentAccess) to verify thread safety ofAuthTokenManagerunder concurrent reads and writes, leveraging the Go race detector.These changes collectively eliminate potential data races, improve reliability in concurrent environments, and provide robust test coverage for concurrency scenarios.
References: