Skip to content

fix(OBS-2257): token refresh failure causes agent to become stale#287

Merged
jajeffries merged 9 commits intodevelopfrom
fix/OBS-2257-stale-jwt
Mar 12, 2026
Merged

fix(OBS-2257): token refresh failure causes agent to become stale#287
jajeffries merged 9 commits intodevelopfrom
fix/OBS-2257-stale-jwt

Conversation

@jajeffries
Copy link
Contributor

@jajeffries jajeffries commented Mar 11, 2026

This pull request introduces concurrency safety improvements to the FleetConfigManager and AuthTokenManager components, ensuring thread-safe access to shared state during agent operation and token refresh workflows. It also updates the mock implementations and associated tests to support concurrent access, preventing data races and improving reliability under multi-goroutine scenarios.

Concurrency safety enhancements:

  • Added sync.RWMutex fields (connMu in FleetConfigManager, mu in AuthTokenManager) to guard shared state, ensuring consistent reads and writes across goroutines. All access to critical fields such as connectionDetails and token-related fields is now protected by appropriate locks. [1] [2]
  • Refactored AuthTokenManager methods (GetToken, RefreshToken, IsTokenExpired, IsTokenExpiringSoon, GetTokenExpiryTime) to use read and write locks, preventing race conditions during concurrent token refresh and expiry checks.

Agent connection and token refresh workflow improvements:

  • Updated the agent reset and reconnect logic in FleetConfigManager.Start to snapshot connection details under a read lock, and to use the latest details during reconnects, ensuring atomicity and consistency during MQTT connection resets.
  • Introduced a robust runReconnectWorker goroutine with exponential backoff and retry logic for token refresh and reconnection, including proper teardown and fast retry signaling after failed attempts. This worker is coordinated via a shared cancellable context for clean shutdown.
  • Ensured that updated connection details are stored under a write lock after token refresh, so all goroutines observe a fully initialized value.

Testing and mock improvements:

  • Enhanced MockMQTTConnection with mutex-protected state and accessor methods (ConnectCalled, DisconnectCalled, LastConnectDetails) to support concurrent test scenarios.
  • Added a new test (TestAuthTokenManager_ConcurrentAccess) to verify thread safety of AuthTokenManager under concurrent reads and writes, leveraging the Go race detector.
  • Updated tests to use the new concurrency-safe mock methods.

These changes collectively eliminate potential data races, improve reliability in concurrent environments, and provide robust test coverage for concurrency scenarios.


References:

…ect worker

When token refresh fails the reconnect goroutine previously logged the
error and immediately waited for the next monitor tick (up to 30s),
leaving the MQTT connection alive with an expired token. The broker
silently rejects heartbeats, causing the server to mark the agent Stale
while the agent process continues running.

Three fixes applied in a new runReconnectWorker method:

1. Exponential backoff retry: up to 5 attempts (5s→10s→20s→40s→2min
   cap) before giving up, respecting ctx cancellation between sleeps.
2. Explicit Disconnect after exhaustion: stops the heartbeat goroutine
   so the server sees the agent as Offline rather than Stale.
3. Fast re-signal via time.AfterFunc: schedules a signal on
   reconnectChan 30s after exhaustion so recovery does not wait for
   the next full monitor tick.

Timing parameters are injected via the method signature so tests can
use millisecond-scale values without slowing the test suite.

Made-with: Cursor
… re-signal

Adds a controlledTokenServer test helper that can be configured to fail
for N requests before returning a valid token, enabling precise control
over the number of retry attempts the worker makes.

Three new test cases cover the three fix scenarios:
- RetriesOnTransientFailure: 2 transient failures then success; asserts
  Disconnect is NOT called and no re-signal is scheduled.
- DisconnectsAfterAllRetriesFail: permanent failure; asserts Disconnect
  is called after all retries are exhausted.
- ReschedulesAfterExhaustion: permanent failure; asserts reconnectChan
  receives a re-signal after the retry delay window.

Also adds DisconnectCalled tracking to MockMQTTConnection to support
the new assertions.

Made-with: Cursor
@github-actions
Copy link

github-actions bot commented Mar 11, 2026

Go test coverage

STATUS ELAPSED PACKAGE COVER PASS FAIL SKIP
🟢 PASS 1.06s github.com/netboxlabs/orb-agent/agent 44.1% 6 0 0
🟢 PASS 31.88s github.com/netboxlabs/orb-agent/agent/backend 76.4% 48 0 0
🟢 PASS 6.03s github.com/netboxlabs/orb-agent/agent/backend/devicediscovery 66.5% 4 0 0
🟢 PASS 0.71s github.com/netboxlabs/orb-agent/agent/backend/mocks 0.0% 0 0 0
🟢 PASS 6.03s github.com/netboxlabs/orb-agent/agent/backend/networkdiscovery 58.3% 4 0 0
🟢 PASS 4.02s github.com/netboxlabs/orb-agent/agent/backend/opentelemetryinfinity 45.2% 2 0 0
🟢 PASS 4.03s github.com/netboxlabs/orb-agent/agent/backend/pktvisor 66.5% 2 0 0
🟢 PASS 6.04s github.com/netboxlabs/orb-agent/agent/backend/snmpdiscovery 58.3% 4 0 0
🟢 PASS 7.03s github.com/netboxlabs/orb-agent/agent/backend/worker 67.8% 7 0 0
🟢 PASS 1.01s github.com/netboxlabs/orb-agent/agent/config 100.0% 6 0 0
🟢 PASS 3.60s github.com/netboxlabs/orb-agent/agent/configmgr 55.3% 39 0 0
🟢 PASS 2.84s github.com/netboxlabs/orb-agent/agent/configmgr/fleet 63.9% 148 0 0
🟢 PASS 1.02s github.com/netboxlabs/orb-agent/agent/otlpbridge 50.0% 10 0 0
🟢 PASS 1.02s github.com/netboxlabs/orb-agent/agent/policies 98.9% 18 0 0
🟢 PASS 1.02s github.com/netboxlabs/orb-agent/agent/policymgr 71.6% 11 0 0
🟢 PASS 1.01s github.com/netboxlabs/orb-agent/agent/redact 81.6% 84 0 0
🟢 PASS 30.69s github.com/netboxlabs/orb-agent/agent/secretsmgr 47.1% 65 0 0
🟢 PASS 1.02s github.com/netboxlabs/orb-agent/agent/telemetry 81.7% 19 0 0
🟢 PASS 1.01s github.com/netboxlabs/orb-agent/agent/version 100.0% 1 0 0

Total coverage: 60.1%

@jajeffries jajeffries self-assigned this Mar 11, 2026
@jajeffries jajeffries marked this pull request as draft March 11, 2026 20:32
Add sync.RWMutex to AuthTokenManager to guard concurrent reads and writes
between the reconnect worker goroutine (writer via GetToken) and the token
expiry monitor goroutine (readers via IsTokenExpired, IsTokenExpiringSoon,
GetTokenExpiryTime). The HTTP call in GetToken is intentionally kept outside
the lock to avoid blocking readers during I/O; all struct field writes are
batched into a single critical section at the end. RefreshToken snapshots
credentials under a read lock before calling GetToken.

Add TestAuthTokenManager_ConcurrentAccess to exercise the race detector:
4 writer and 4 reader goroutines run concurrently for 100ms; the test
fails under go test -race without the mutex.

Made-with: Cursor
Pass monitorCtx to runReconnectWorker so Stop()'s monitorCancel() also
terminates the reconnect goroutine. Replace the for-range loop (which
only exits when the channel is closed) with a for/select on ctx.Done()
and reconnectChan so the worker exits promptly on shutdown.

Made-with: Cursor
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b842c3f184

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The timer created by time.AfterFunc in runReconnectWorker was discarded,
so its callback could fire after Stop() cancelled monitorCtx — sending on
reconnectChan and logging after the manager was logically shut down.

Guard the callback with a ctx.Err() check: if the context has been
cancelled, the callback returns immediately without side effects.

Made-with: Cursor
The reset goroutine in Start() captured local variables (connectionDetails,
backends, cfg.OrbAgent.Labels, configYaml, topics) by closure at startup.
After a successful JWT refresh, refreshAndReconnect updated
fleetManager.connectionDetails and related fields, but the reset goroutine
continued reconnecting with the stale initial values.

Switch all closure references to fleetManager.* struct fields so the
goroutine always uses the most recently refreshed token and topics.

Also capture ConnectionDetails in MockMQTTConnection.LastConnectDetails
to enable assertion of which details were passed to Connect in tests.

Made-with: Cursor
…ionDetails

- TestFleetConfigManager_ReconnectWorker_AfterFuncSuppressedOnContextCancel:
  exhausts retries, cancels the context before retryDelay elapses, and
  asserts that no re-signal is sent on reconnectChan after shutdown.

- TestFleetConfigManager_ResetGoroutine_UsesLatestConnectionDetails:
  primes the manager with an initial token, simulates a token refresh by
  updating fleetManager.connectionDetails, triggers a reset, and asserts
  that Connect was called with the refreshed token rather than the stale one.

Made-with: Cursor
@jajeffries jajeffries marked this pull request as ready for review March 11, 2026 20:46
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc4d019324

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Two P1 issues flagged by the automated Codex review:

1. Bound Disconnect with a timeout in runReconnectWorker
   The Disconnect call previously used the long-lived monitorCtx. If the
   MQTT broker hung during teardown the call could block indefinitely,
   preventing the time.AfterFunc re-signal from ever being scheduled.
   Fixed by creating a context.WithTimeout(context.Background(), timeout)
   for the Disconnect call, consistent with the existing reset goroutine.

2. Synchronise connectionDetails across goroutines
   refreshAndReconnect (reconnect worker goroutine) wrote
   fleetManager.connectionDetails while the reset goroutine could read it
   concurrently — a data race. Fixed by adding connMu sync.RWMutex to
   FleetConfigManager: writes in refreshAndReconnect are guarded by
   connMu.Lock(); reads in the reset goroutine and runReconnectWorker
   snapshot the value under connMu.RLock() before use.

   Also promotes ConnectCalled, DisconnectCalled, and LastConnectDetails
   on MockMQTTConnection to mutex-guarded accessor methods, eliminating a
   secondary race flagged by go test -race between the goroutine under
   test and the require.Eventually poller in tests.

Made-with: Cursor
@jajeffries jajeffries merged commit 0914e98 into develop Mar 12, 2026
5 checks passed
@jajeffries jajeffries deleted the fix/OBS-2257-stale-jwt branch March 12, 2026 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants