fix(OBS-2257): token refresh failure causes agent to become stale by jajeffries · Pull Request #287 · netboxlabs/orb-agent

jajeffries · 2026-03-11T20:23:09Z

This pull request introduces concurrency safety improvements to the FleetConfigManager and AuthTokenManager components, ensuring thread-safe access to shared state during agent operation and token refresh workflows. It also updates the mock implementations and associated tests to support concurrent access, preventing data races and improving reliability under multi-goroutine scenarios.

Concurrency safety enhancements:

Added sync.RWMutex fields (connMu in FleetConfigManager, mu in AuthTokenManager) to guard shared state, ensuring consistent reads and writes across goroutines. All access to critical fields such as connectionDetails and token-related fields is now protected by appropriate locks. [1] [2]
Refactored AuthTokenManager methods (GetToken, RefreshToken, IsTokenExpired, IsTokenExpiringSoon, GetTokenExpiryTime) to use read and write locks, preventing race conditions during concurrent token refresh and expiry checks.

Agent connection and token refresh workflow improvements:

Updated the agent reset and reconnect logic in FleetConfigManager.Start to snapshot connection details under a read lock, and to use the latest details during reconnects, ensuring atomicity and consistency during MQTT connection resets.
Introduced a robust runReconnectWorker goroutine with exponential backoff and retry logic for token refresh and reconnection, including proper teardown and fast retry signaling after failed attempts. This worker is coordinated via a shared cancellable context for clean shutdown.
Ensured that updated connection details are stored under a write lock after token refresh, so all goroutines observe a fully initialized value.

Testing and mock improvements:

Enhanced MockMQTTConnection with mutex-protected state and accessor methods (ConnectCalled, DisconnectCalled, LastConnectDetails) to support concurrent test scenarios.
Added a new test (TestAuthTokenManager_ConcurrentAccess) to verify thread safety of AuthTokenManager under concurrent reads and writes, leveraging the Go race detector.
Updated tests to use the new concurrency-safe mock methods.

These changes collectively eliminate potential data races, improve reliability in concurrent environments, and provide robust test coverage for concurrency scenarios.

References:

[1] [2] [3] [4] [5] [6] [7] [8] [9]

…ect worker When token refresh fails the reconnect goroutine previously logged the error and immediately waited for the next monitor tick (up to 30s), leaving the MQTT connection alive with an expired token. The broker silently rejects heartbeats, causing the server to mark the agent Stale while the agent process continues running. Three fixes applied in a new runReconnectWorker method: 1. Exponential backoff retry: up to 5 attempts (5s→10s→20s→40s→2min cap) before giving up, respecting ctx cancellation between sleeps. 2. Explicit Disconnect after exhaustion: stops the heartbeat goroutine so the server sees the agent as Offline rather than Stale. 3. Fast re-signal via time.AfterFunc: schedules a signal on reconnectChan 30s after exhaustion so recovery does not wait for the next full monitor tick. Timing parameters are injected via the method signature so tests can use millisecond-scale values without slowing the test suite. Made-with: Cursor

… re-signal Adds a controlledTokenServer test helper that can be configured to fail for N requests before returning a valid token, enabling precise control over the number of retry attempts the worker makes. Three new test cases cover the three fix scenarios: - RetriesOnTransientFailure: 2 transient failures then success; asserts Disconnect is NOT called and no re-signal is scheduled. - DisconnectsAfterAllRetriesFail: permanent failure; asserts Disconnect is called after all retries are exhausted. - ReschedulesAfterExhaustion: permanent failure; asserts reconnectChan receives a re-signal after the retry delay window. Also adds DisconnectCalled tracking to MockMQTTConnection to support the new assertions. Made-with: Cursor

github-actions · 2026-03-11T20:25:41Z

Go test coverage

STATUS	ELAPSED	PACKAGE	COVER	PASS
🟢 PASS	1.06s	github.com/netboxlabs/orb-agent/agent	44.1%	6
🟢 PASS	31.88s	github.com/netboxlabs/orb-agent/agent/backend	76.4%	48
🟢 PASS	6.03s	github.com/netboxlabs/orb-agent/agent/backend/devicediscovery	66.5%	4
🟢 PASS	0.71s	github.com/netboxlabs/orb-agent/agent/backend/mocks	0.0%	0
🟢 PASS	6.03s	github.com/netboxlabs/orb-agent/agent/backend/networkdiscovery	58.3%	4
🟢 PASS	4.02s	github.com/netboxlabs/orb-agent/agent/backend/opentelemetryinfinity	45.2%	2
🟢 PASS	4.03s	github.com/netboxlabs/orb-agent/agent/backend/pktvisor	66.5%	2
🟢 PASS	6.04s	github.com/netboxlabs/orb-agent/agent/backend/snmpdiscovery	58.3%	4
🟢 PASS	7.03s	github.com/netboxlabs/orb-agent/agent/backend/worker	67.8%	7
🟢 PASS	1.01s	github.com/netboxlabs/orb-agent/agent/config	100.0%	6
🟢 PASS	3.60s	github.com/netboxlabs/orb-agent/agent/configmgr	55.3%	39
🟢 PASS	2.84s	github.com/netboxlabs/orb-agent/agent/configmgr/fleet	63.9%	148
🟢 PASS	1.02s	github.com/netboxlabs/orb-agent/agent/otlpbridge	50.0%	10
🟢 PASS	1.02s	github.com/netboxlabs/orb-agent/agent/policies	98.9%	18
🟢 PASS	1.02s	github.com/netboxlabs/orb-agent/agent/policymgr	71.6%	11
🟢 PASS	1.01s	github.com/netboxlabs/orb-agent/agent/redact	81.6%	84
🟢 PASS	30.69s	github.com/netboxlabs/orb-agent/agent/secretsmgr	47.1%	65
🟢 PASS	1.02s	github.com/netboxlabs/orb-agent/agent/telemetry	81.7%	19
🟢 PASS	1.01s	github.com/netboxlabs/orb-agent/agent/version	100.0%	1

Total coverage: 60.1%

Add sync.RWMutex to AuthTokenManager to guard concurrent reads and writes between the reconnect worker goroutine (writer via GetToken) and the token expiry monitor goroutine (readers via IsTokenExpired, IsTokenExpiringSoon, GetTokenExpiryTime). The HTTP call in GetToken is intentionally kept outside the lock to avoid blocking readers during I/O; all struct field writes are batched into a single critical section at the end. RefreshToken snapshots credentials under a read lock before calling GetToken. Add TestAuthTokenManager_ConcurrentAccess to exercise the race detector: 4 writer and 4 reader goroutines run concurrently for 100ms; the test fails under go test -race without the mutex. Made-with: Cursor

Pass monitorCtx to runReconnectWorker so Stop()'s monitorCancel() also terminates the reconnect goroutine. Replace the for-range loop (which only exits when the channel is closed) with a for/select on ctx.Done() and reconnectChan so the worker exits promptly on shutdown. Made-with: Cursor

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b842c3f184

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

agent/configmgr/fleet.go

The timer created by time.AfterFunc in runReconnectWorker was discarded, so its callback could fire after Stop() cancelled monitorCtx — sending on reconnectChan and logging after the manager was logically shut down. Guard the callback with a ctx.Err() check: if the context has been cancelled, the callback returns immediately without side effects. Made-with: Cursor

The reset goroutine in Start() captured local variables (connectionDetails, backends, cfg.OrbAgent.Labels, configYaml, topics) by closure at startup. After a successful JWT refresh, refreshAndReconnect updated fleetManager.connectionDetails and related fields, but the reset goroutine continued reconnecting with the stale initial values. Switch all closure references to fleetManager.* struct fields so the goroutine always uses the most recently refreshed token and topics. Also capture ConnectionDetails in MockMQTTConnection.LastConnectDetails to enable assertion of which details were passed to Connect in tests. Made-with: Cursor

…ionDetails - TestFleetConfigManager_ReconnectWorker_AfterFuncSuppressedOnContextCancel: exhausts retries, cancels the context before retryDelay elapses, and asserts that no re-signal is sent on reconnectChan after shutdown. - TestFleetConfigManager_ResetGoroutine_UsesLatestConnectionDetails: primes the manager with an initial token, simulates a token refresh by updating fleetManager.connectionDetails, triggers a reset, and asserts that Connect was called with the refreshed token rather than the stale one. Made-with: Cursor

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc4d019324

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

agent/configmgr/fleet.go

Two P1 issues flagged by the automated Codex review: 1. Bound Disconnect with a timeout in runReconnectWorker The Disconnect call previously used the long-lived monitorCtx. If the MQTT broker hung during teardown the call could block indefinitely, preventing the time.AfterFunc re-signal from ever being scheduled. Fixed by creating a context.WithTimeout(context.Background(), timeout) for the Disconnect call, consistent with the existing reset goroutine. 2. Synchronise connectionDetails across goroutines refreshAndReconnect (reconnect worker goroutine) wrote fleetManager.connectionDetails while the reset goroutine could read it concurrently — a data race. Fixed by adding connMu sync.RWMutex to FleetConfigManager: writes in refreshAndReconnect are guarded by connMu.Lock(); reads in the reset goroutine and runReconnectWorker snapshot the value under connMu.RLock() before use. Also promotes ConnectCalled, DisconnectCalled, and LastConnectDetails on MockMQTTConnection to mutex-guarded accessor methods, eliminating a secondary race flagged by go test -race between the goroutine under test and the require.Eventually poller in tests. Made-with: Cursor

jajeffries added 2 commits March 11, 2026 20:22

jajeffries requested review from MicahParks, grant-nbl, leoparente, manrodrigues, mfiedorowicz and paulstuart as code owners March 11, 2026 20:23

jajeffries self-assigned this Mar 11, 2026

jajeffries marked this pull request as draft March 11, 2026 20:32

jajeffries added 2 commits March 11, 2026 20:36

chatgpt-codex-connector bot reviewed Mar 11, 2026

View reviewed changes

agent/configmgr/fleet.go Outdated Show resolved Hide resolved

jajeffries added 3 commits March 11, 2026 20:45

jajeffries marked this pull request as ready for review March 11, 2026 20:46

chatgpt-codex-connector bot reviewed Mar 11, 2026

View reviewed changes

agent/configmgr/fleet.go Outdated Show resolved Hide resolved

jajeffries added 2 commits March 12, 2026 10:03

upgrade vulat packages

d46e73b

leoparente approved these changes Mar 12, 2026

View reviewed changes

jajeffries merged commit 0914e98 into develop Mar 12, 2026
5 checks passed

jajeffries deleted the fix/OBS-2257-stale-jwt branch March 12, 2026 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(OBS-2257): token refresh failure causes agent to become stale#287

fix(OBS-2257): token refresh failure causes agent to become stale#287
jajeffries merged 9 commits intodevelopfrom
fix/OBS-2257-stale-jwt

jajeffries commented Mar 11, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jajeffries commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jajeffries commented Mar 11, 2026 •

edited

Loading

github-actions bot commented Mar 11, 2026 •

edited

Loading