From b151ad4fd30676fea763e0cb66b159a20d1f912b Mon Sep 17 00:00:00 2001 From: Claude Code Date: Sun, 1 Mar 2026 16:29:55 +0200 Subject: [PATCH 1/2] fix(oauth): prevent refresh manager deadloop on server-not-found errors [#310] Fix three bugs causing infinite 0-delay retry loops when an OAuth server is removed from config or becomes unavailable: 1. Integer overflow in calculateBackoff: 1<=30, producing negative then zero durations. Cap the exponent at 25 and guard against non-positive results. 2. "server not found" errors classified as retryable: Add new terminal classification "failed_server_gone" for server-not-found and server-does-not-use-OAuth errors that stop retries immediately. 3. No maximum retry limit: Change DefaultMaxRetries from 0 (unlimited) to 50 as a circuit breaker. Also enforce MinRefreshInterval (5s) as minimum delay floor in rescheduleAfterDelay. Co-Authored-By: Claude Opus 4.6 --- autonomous_summary.md | 71 +++++++++++ internal/oauth/refresh_manager.go | 90 +++++++++++-- internal/oauth/refresh_manager_test.go | 167 +++++++++++++++++++++++++ plan.md | 87 +++++++++++++ spec.md | 100 +++++++++++++++ 5 files changed, 507 insertions(+), 8 deletions(-) create mode 100644 autonomous_summary.md create mode 100644 plan.md create mode 100644 spec.md diff --git a/autonomous_summary.md b/autonomous_summary.md new file mode 100644 index 00000000..6e92b65a --- /dev/null +++ b/autonomous_summary.md @@ -0,0 +1,71 @@ +# Autonomous Summary: Fix OAuth Refresh Deadloop (Issue #310) + +## Problem + +When an OAuth server was removed from configuration or became unavailable, the refresh manager entered an infinite retry loop with 0-second delays, producing 23M+ retries and flooding logs. The only recovery was restarting mcpproxy. + +## Root Cause + +Three interconnected bugs in `internal/oauth/refresh_manager.go`: + +1. **Integer overflow in `calculateBackoff`**: The expression `1 << uint(retryCount)` overflows at retryCount >= 30 on 64-bit systems. At retryCount 30, the shift produces a value that, when multiplied by `RetryBackoffBase` (10s in nanoseconds), overflows `int64` to a negative value. The comparison `backoff > MaxRetryBackoff` fails to catch negative values. At retryCount >= 64, the shift wraps to 0, producing `backoff = 0s` permanently. + +2. **"Server not found" classified as retryable**: The error `"server not found: gcw2"` from `upstream/manager.go` was classified as `"failed_other"` — neither terminal nor network-retryable. This caused it to enter the retry path, which eventually hit the overflow bug. + +3. **No maximum retry limit**: `DefaultMaxRetries = 0` meant unlimited retries, so once the backoff dropped to 0s, retries looped infinitely with no circuit breaker. + +## Changes Made + +### `internal/oauth/refresh_manager.go` + +1. **Fixed `calculateBackoff` overflow** (lines 800-820): + - Added `maxBackoffExponent = 25` constant + - When `retryCount > maxBackoffExponent`, returns `MaxRetryBackoff` immediately + - Added post-computation guard: if `backoff <= 0`, use `MaxRetryBackoff` + +2. **Added terminal error classification** (lines 760-770): + - New classification `"failed_server_gone"` for "server not found" and "server does not use OAuth" errors + - These are checked before network/grant errors since they should never be retried + +3. **Handle terminal errors in `handleRefreshFailure`** (lines 718-733): + - `"failed_server_gone"` errors immediately set `RefreshStateFailed` and emit failure event + - No retry scheduled — the server won't magically reappear + +4. **Changed `DefaultMaxRetries` from 0 to 50** (line 25): + - Circuit breaker: after 50 consecutive failures, stops retrying + - With exponential backoff (10s base, 5min cap), 50 retries spans ~2+ hours + +5. **Check max retries in `handleRefreshFailure`** (lines 735-750): + - When `retryCount >= maxRetries`, sets `RefreshStateFailed` and emits failure event + +6. **Enforced minimum delay in `rescheduleAfterDelay`** (lines 825-828): + - If `delay < MinRefreshInterval` (5s), uses `MinRefreshInterval` + - Defense-in-depth against any code path producing 0 or negative delays + +### `internal/oauth/refresh_manager_test.go` + +Added 6 new test functions: + +- `TestRefreshManager_BackoffOverflowProtection`: Verifies backoff is positive and capped at boundaries (retryCount 30, 63, 64, 100, 1000, 23158728) and exhaustively for 0-10000 +- `TestRefreshManager_ServerNotFoundIsTerminal`: Verifies "server not found" stops refresh immediately with no retry +- `TestRefreshManager_MaxRetryLimit`: Verifies max retry limit stops retries and emits failure event +- `TestDefaultMaxRetries_IsNonZero`: Verifies the constant is non-zero +- `TestRefreshManager_MinimumDelayEnforced`: Verifies 0 and negative delays are clamped to MinRefreshInterval +- Extended `TestClassifyRefreshError` with 3 new cases for server-gone errors + +## Test Results + +``` +go test -race ./internal/oauth/... -v → PASS (17.2s) +go build ./internal/oauth/... → OK +go vet ./internal/oauth/... → OK +``` + +All 28 tests in the oauth package pass, including all pre-existing tests (backward compatible). + +## Verification + +The fix addresses all three failure modes from the issue: +- **Overflow**: `calculateBackoff(23158728)` now returns `5m0s` (was `0s`) +- **Terminal errors**: "server not found" immediately stops (was infinite retry) +- **Circuit breaker**: Even unknown errors stop after 50 retries (was unlimited) diff --git a/internal/oauth/refresh_manager.go b/internal/oauth/refresh_manager.go index a49a0a8c..a782f646 100644 --- a/internal/oauth/refresh_manager.go +++ b/internal/oauth/refresh_manager.go @@ -20,9 +20,10 @@ const ( // 0.75 means refresh at 75% of lifetime for long-lived tokens. DefaultRefreshThreshold = 0.75 - // DefaultMaxRetries is the maximum number of refresh attempts before giving up. - // Set to 0 for unlimited retries until token expiration (FR-009). - DefaultMaxRetries = 0 + // DefaultMaxRetries is the maximum number of consecutive refresh attempts before giving up. + // Acts as a circuit breaker to prevent infinite retry loops (issue #310). + // With exponential backoff (10s base, 5min cap), 50 retries spans ~2+ hours. + DefaultMaxRetries = 50 // MinRefreshInterval prevents too-frequent refresh attempts. MinRefreshInterval = 5 * time.Second @@ -672,7 +673,8 @@ func (m *RefreshManager) handleRefreshSuccess(serverName string) { } // handleRefreshFailure handles a failed token refresh with exponential backoff retry. -// Per FR-009: Retries continue until token expiration (unlimited retries), not a fixed count. +// Terminal errors (invalid_grant, server not found) stop immediately. +// Transient errors retry with exponential backoff up to maxRetries. func (m *RefreshManager) handleRefreshFailure(serverName string, err error) { m.mu.Lock() schedule := m.schedules[serverName] @@ -715,8 +717,48 @@ func (m *RefreshManager) handleRefreshFailure(serverName string, err error) { return } - // Check if we should continue retrying (unlimited retries until token expiration per FR-009) - // Only stop if the access token has completely expired AND no more time remains + // Check if the server is gone (removed from config or not OAuth). + // No amount of retrying will fix this - stop immediately. + if errorType == "failed_server_gone" { + m.logger.Error("OAuth refresh stopped - server no longer available", + zap.String("server", serverName), + zap.Error(err)) + + m.mu.Lock() + if schedule := m.schedules[serverName]; schedule != nil { + schedule.RefreshState = RefreshStateFailed + schedule.LastError = err.Error() + } + m.mu.Unlock() + + if m.eventEmitter != nil { + m.eventEmitter.EmitOAuthRefreshFailed(serverName, err.Error()) + } + return + } + + // Check if max retries exceeded (circuit breaker to prevent infinite loops). + if m.maxRetries > 0 && retryCount >= m.maxRetries { + m.logger.Error("OAuth token refresh failed - max retries exceeded", + zap.String("server", serverName), + zap.Int("max_retries", m.maxRetries), + zap.Int("retry_count", retryCount)) + + m.mu.Lock() + if schedule := m.schedules[serverName]; schedule != nil { + schedule.RefreshState = RefreshStateFailed + schedule.LastError = "Max retries exceeded - re-authentication required" + } + m.mu.Unlock() + + if m.eventEmitter != nil { + m.eventEmitter.EmitOAuthRefreshFailed(serverName, err.Error()) + } + return + } + + // Check if we should continue retrying based on token expiration. + // Only stop if the access token has completely expired AND no more time remains. now := time.Now() if !expiresAt.IsZero() && now.After(expiresAt) { // Token has already expired - check if we should give up @@ -748,7 +790,7 @@ func (m *RefreshManager) handleRefreshFailure(serverName string, err error) { } // classifyRefreshError categorizes a refresh error for metrics and error handling. -// Returns one of: "failed_network", "failed_invalid_grant", "failed_other". +// Returns one of: "failed_network", "failed_invalid_grant", "failed_server_gone", "failed_other". func classifyRefreshError(err error) string { if err == nil { return "success" @@ -756,6 +798,18 @@ func classifyRefreshError(err error) string { errStr := err.Error() + // Check for terminal server-gone errors (server removed from config or not OAuth). + // These should never be retried because the server no longer exists or doesn't use OAuth. + serverGoneErrors := []string{ + "server not found", + "server does not use OAuth", + } + for _, pattern := range serverGoneErrors { + if stringutil.ContainsIgnoreCase(errStr, pattern) { + return "failed_server_gone" + } + } + // Check for permanent OAuth errors (refresh token invalid/expired) permanentErrors := []string{ "invalid_grant", @@ -789,15 +843,29 @@ func classifyRefreshError(err error) string { return "failed_other" } +// maxBackoffExponent is the maximum shift exponent that won't overflow when +// multiplied by RetryBackoffBase (10s = 10_000_000_000 ns). On 64-bit systems, +// 1<<30 * 10e9 overflows int64. We cap at 25 which gives 10s * 2^25 = 335,544,320s +// — well above MaxRetryBackoff (300s), so the cap applies naturally. +const maxBackoffExponent = 25 + // calculateBackoff calculates the exponential backoff duration for a given retry count. // The formula is: base * 2^retryCount, capped at MaxRetryBackoff (5 minutes). // Sequence: 10s → 20s → 40s → 80s → 160s → 300s (cap). +// +// The exponent is capped at maxBackoffExponent to prevent integer overflow +// that caused issue #310 (0s delay at high retry counts leading to infinite loops). func (m *RefreshManager) calculateBackoff(retryCount int) time.Duration { if retryCount < 0 { retryCount = 0 } + // Cap the exponent to prevent integer overflow. + // Beyond this exponent, the result would exceed MaxRetryBackoff anyway. + if retryCount > maxBackoffExponent { + return MaxRetryBackoff + } backoff := RetryBackoffBase * time.Duration(1< MaxRetryBackoff { + if backoff <= 0 || backoff > MaxRetryBackoff { backoff = MaxRetryBackoff } return backoff @@ -814,7 +882,13 @@ func (m *RefreshManager) isRateLimited(schedule *RefreshSchedule) bool { } // rescheduleAfterDelay reschedules a refresh attempt after a delay. +// The delay is enforced to be at least MinRefreshInterval to prevent tight loops. func (m *RefreshManager) rescheduleAfterDelay(serverName string, delay time.Duration) { + // Enforce minimum delay to prevent tight retry loops (defense-in-depth for issue #310). + if delay < MinRefreshInterval { + delay = MinRefreshInterval + } + m.mu.Lock() defer m.mu.Unlock() diff --git a/internal/oauth/refresh_manager_test.go b/internal/oauth/refresh_manager_test.go index c79fbd8d..fef869ee 100644 --- a/internal/oauth/refresh_manager_test.go +++ b/internal/oauth/refresh_manager_test.go @@ -647,6 +647,10 @@ func TestClassifyRefreshError(t *testing.T) { {"context deadline exceeded", errors.New("context deadline exceeded"), "failed_network"}, {"generic error", errors.New("unknown server error"), "failed_other"}, {"server_error", errors.New("OAuth server_error"), "failed_other"}, + // Terminal errors: server gone + {"server not found", errors.New("server not found: gcw2"), "failed_server_gone"}, + {"server not found wrapped", errors.New("failed to refresh OAuth token: server not found: myserver"), "failed_server_gone"}, + {"server does not use OAuth", errors.New("server does not use OAuth: gcw2"), "failed_server_gone"}, } for _, tt := range tests { @@ -656,3 +660,166 @@ func TestClassifyRefreshError(t *testing.T) { }) } } + +// Test that calculateBackoff never returns zero or negative for any retry count (overflow protection). +func TestRefreshManager_BackoffOverflowProtection(t *testing.T) { + logger := zaptest.NewLogger(t) + manager := NewRefreshManager(nil, nil, nil, logger) + + // Specific boundary cases that triggered the original bug + boundaryTests := []struct { + retryCount int + desc string + }{ + {30, "overflow boundary on 64-bit (10s * 2^30 overflows nanoseconds)"}, + {63, "1<<63 is min int64, duration becomes 0 or negative"}, + {64, "1<<64 wraps to 0 on 64-bit"}, + {100, "well beyond overflow"}, + {1000, "extreme retry count"}, + {23158728, "actual retry count from issue #310"}, + } + + for _, tt := range boundaryTests { + t.Run(tt.desc, func(t *testing.T) { + backoff := manager.calculateBackoff(tt.retryCount) + assert.Greater(t, backoff, time.Duration(0), + "Backoff must be positive at retryCount=%d", tt.retryCount) + assert.LessOrEqual(t, backoff, MaxRetryBackoff, + "Backoff must not exceed MaxRetryBackoff at retryCount=%d", tt.retryCount) + }) + } + + // Exhaustive check: no retry count from 0 to 10000 produces zero or negative backoff + for i := 0; i <= 10000; i++ { + backoff := manager.calculateBackoff(i) + if backoff <= 0 || backoff > MaxRetryBackoff { + t.Fatalf("calculateBackoff(%d) = %v, want positive and <= %v", i, backoff, MaxRetryBackoff) + } + } +} + +// Test that "server not found" is treated as terminal (no retry). +func TestRefreshManager_ServerNotFoundIsTerminal(t *testing.T) { + logger := zaptest.NewLogger(t) + store := newMockTokenStore() + emitter := &mockEventEmitter{} + + manager := NewRefreshManager(store, nil, nil, logger) + runtime := &mockRuntime{ + refreshErr: errors.New("failed to refresh OAuth token: server not found: gcw2"), + } + manager.SetRuntime(runtime) + manager.SetEventEmitter(emitter) + + ctx := context.Background() + err := manager.Start(ctx) + require.NoError(t, err) + defer manager.Stop() + + // Create a schedule and trigger refresh + expiresAt := time.Now().Add(1 * time.Hour) + manager.OnTokenSaved("gcw2", expiresAt) + time.Sleep(50 * time.Millisecond) + + // Execute refresh which will fail with "server not found" + manager.executeRefresh("gcw2") + time.Sleep(50 * time.Millisecond) + + // Should be in failed state (not retrying) + schedule := manager.GetSchedule("gcw2") + require.NotNil(t, schedule) + assert.Equal(t, RefreshStateFailed, schedule.RefreshState, + "Server not found should immediately fail, not retry") + assert.Equal(t, 1, schedule.RetryCount, "Should have exactly 1 attempt") + + // Failure event should be emitted + assert.Equal(t, 1, emitter.GetFailedEvents(), + "Should emit failure event for terminal server-not-found error") +} + +// Test that max retry limit stops retries. +func TestRefreshManager_MaxRetryLimit(t *testing.T) { + logger := zaptest.NewLogger(t) + store := newMockTokenStore() + emitter := &mockEventEmitter{} + + config := &RefreshManagerConfig{ + Threshold: 0.1, + MaxRetries: 5, // Low limit for testing + } + + manager := NewRefreshManager(store, nil, config, logger) + runtime := &mockRuntime{ + refreshErr: errors.New("some transient error"), + } + manager.SetRuntime(runtime) + manager.SetEventEmitter(emitter) + + ctx := context.Background() + err := manager.Start(ctx) + require.NoError(t, err) + defer manager.Stop() + + // Create a schedule + expiresAt := time.Now().Add(1 * time.Hour) + manager.OnTokenSaved("test-server", expiresAt) + time.Sleep(50 * time.Millisecond) + + // Simulate failures beyond max retries + for i := 0; i < 7; i++ { + manager.executeRefresh("test-server") + time.Sleep(10 * time.Millisecond) + } + + // After exceeding max retries, should be in failed state + schedule := manager.GetSchedule("test-server") + require.NotNil(t, schedule) + assert.Equal(t, RefreshStateFailed, schedule.RefreshState, + "Should transition to failed after max retries exceeded") + + // Failure event should be emitted + assert.GreaterOrEqual(t, emitter.GetFailedEvents(), 1, + "Should emit failure event when max retries exceeded") +} + +// Test that DefaultMaxRetries is now non-zero. +func TestDefaultMaxRetries_IsNonZero(t *testing.T) { + assert.Greater(t, DefaultMaxRetries, 0, + "DefaultMaxRetries must be non-zero to prevent infinite retry loops") +} + +// Test that rescheduleAfterDelay enforces minimum delay. +func TestRefreshManager_MinimumDelayEnforced(t *testing.T) { + logger := zaptest.NewLogger(t) + store := newMockTokenStore() + + manager := NewRefreshManager(store, nil, nil, logger) + runtime := &mockRuntime{} + manager.SetRuntime(runtime) + + ctx := context.Background() + err := manager.Start(ctx) + require.NoError(t, err) + defer manager.Stop() + + // Create a schedule first + expiresAt := time.Now().Add(1 * time.Hour) + manager.OnTokenSaved("test-server", expiresAt) + time.Sleep(50 * time.Millisecond) + + // Call rescheduleAfterDelay with zero delay + manager.rescheduleAfterDelay("test-server", 0) + + schedule := manager.GetSchedule("test-server") + require.NotNil(t, schedule) + assert.GreaterOrEqual(t, schedule.RetryBackoff, MinRefreshInterval, + "Delay should be at least MinRefreshInterval even when 0 is passed") + + // Call rescheduleAfterDelay with negative delay + manager.rescheduleAfterDelay("test-server", -5*time.Second) + + schedule = manager.GetSchedule("test-server") + require.NotNil(t, schedule) + assert.GreaterOrEqual(t, schedule.RetryBackoff, MinRefreshInterval, + "Delay should be at least MinRefreshInterval even when negative is passed") +} diff --git a/plan.md b/plan.md new file mode 100644 index 00000000..9e66fd53 --- /dev/null +++ b/plan.md @@ -0,0 +1,87 @@ +# Implementation Plan: Fix OAuth Refresh Deadloop (Issue #310) + +## Overview + +Four changes to `internal/oauth/refresh_manager.go` and updates to `internal/oauth/refresh_manager_test.go`, implemented in TDD order. + +## Step 1: Fix `calculateBackoff` Integer Overflow + +### Test First +Add `TestRefreshManager_BackoffOverflowProtection` to verify: +- `calculateBackoff(30)` returns `MaxRetryBackoff` (not negative/zero) +- `calculateBackoff(63)` returns `MaxRetryBackoff` (not zero) +- `calculateBackoff(64)` returns `MaxRetryBackoff` (not zero) +- `calculateBackoff(1000)` returns `MaxRetryBackoff` (not zero) +- Loop 0..10000 verifying all results > 0 and <= MaxRetryBackoff + +### Implementation +In `calculateBackoff`: +- Cap `retryCount` to a safe maximum exponent (e.g., 30) to prevent overflow +- Add post-computation guard: if `backoff <= 0`, set to `MaxRetryBackoff` + +## Step 2: Add Terminal Error Classification for "Server Not Found" + +### Test First +Add test cases to `TestClassifyRefreshError`: +- `"server not found: gcw2"` -> `"failed_server_gone"` +- `"server does not use OAuth: gcw2"` -> `"failed_server_gone"` + +Add `TestRefreshManager_ServerNotFoundIsTerminal` to verify: +- `handleRefreshFailure` with "server not found" error stops immediately +- Schedule transitions to `RefreshStateFailed` +- Failure event is emitted + +### Implementation +In `classifyRefreshError`: +- Add new terminal error patterns: `"server not found"`, `"server does not use OAuth"` +- Return `"failed_server_gone"` for these patterns + +In `handleRefreshFailure`: +- Handle `"failed_server_gone"` the same as `"failed_invalid_grant"` (immediate stop, no retry) + +## Step 3: Add Maximum Retry Limit + +### Test First +Add `TestRefreshManager_MaxRetryLimit` to verify: +- After `DefaultMaxRetries` (50) consecutive failures, schedule transitions to `RefreshStateFailed` +- Failure event is emitted when limit is reached + +### Implementation +- Change `DefaultMaxRetries` from `0` to `50` +- In `handleRefreshFailure`, after classifying the error and before calculating backoff: + - Check if `retryCount >= m.maxRetries` (when `m.maxRetries > 0`) + - If exceeded, log error, set state to `RefreshStateFailed`, emit event, return + +Note: Update `NewRefreshManager` to use the new default -- currently `if config.MaxRetries > 0` would override, so 0 still means "use default of 50". + +## Step 4: Enforce Minimum Delay Floor in `rescheduleAfterDelay` + +### Test First +Add `TestRefreshManager_MinimumDelayEnforced` to verify: +- `rescheduleAfterDelay` with 0 delay uses `MinRefreshInterval` instead +- `rescheduleAfterDelay` with negative delay uses `MinRefreshInterval` instead + +### Implementation +In `rescheduleAfterDelay`: +- Add check: `if delay < MinRefreshInterval { delay = MinRefreshInterval }` + +## Step 5: Verify and Commit + +1. Run `go test -race ./internal/oauth/... -v` +2. Run `go build ./...` +3. Write `autonomous_summary.md` +4. Commit all changes on branch `fix/oauth-refresh-deadloop-310` + +## Files Modified + +| File | Changes | +|------|---------| +| `internal/oauth/refresh_manager.go` | Fix `calculateBackoff`, add terminal error handling in `handleRefreshFailure`, enforce min delay in `rescheduleAfterDelay`, change `DefaultMaxRetries` | +| `internal/oauth/refresh_manager_test.go` | Add 5+ new test functions covering overflow, terminal errors, max retries, min delay | + +## Risk Assessment + +- **Low risk**: All changes are defensive additions (caps, guards, new classifications) +- **No behavioral change for normal operations**: Backoff sequence 10s->20s->40s->80s->160s->300s unchanged for retryCount 0-5 +- **No API changes**: No public interface modifications +- **Backward compatible**: Existing configs continue to work diff --git a/spec.md b/spec.md new file mode 100644 index 00000000..14feba01 --- /dev/null +++ b/spec.md @@ -0,0 +1,100 @@ +# Spec: Fix OAuth Refresh Deadloop (Issue #310) + +## Problem Statement + +When an OAuth server is removed from configuration or becomes unavailable, the refresh manager enters an infinite retry loop with 0-second delays, producing millions of retries per minute and spamming logs. The only recovery is restarting mcpproxy. + +### Root Cause Analysis + +There are **three distinct bugs** contributing to this issue: + +#### Bug 1: Integer Overflow in Backoff Calculation + +The `calculateBackoff` function uses `1 << uint(retryCount)` which overflows on 64-bit systems: + +- At `retryCount >= 30`: `1 << 30 = 1073741824`, multiplied by `10s` base overflows `time.Duration` (int64 nanoseconds) to a **negative value**. The `backoff > MaxRetryBackoff` check does NOT catch negative values, so the uncapped negative duration is used. +- At `retryCount >= 63`: `1 << 63` overflows to the minimum int64 value, and `1 << 64 = 0`, giving `backoff = 0s`. +- At `retryCount >= 64`: All further shifts produce `0`, giving `backoff = 0s` permanently. + +This means after ~30 retries (about 5 minutes of exponential backoff), the delay drops to 0 and retries become infinite tight loops. + +#### Bug 2: "Server Not Found" Classified as Retryable + +The error `"server not found: gcw2"` from `upstream/manager.go` is classified as `"failed_other"` by `classifyRefreshError`. This classification does NOT trigger the permanent failure path (only `"failed_invalid_grant"` does). Instead, it enters the retry path with backoff, which then hits Bug 1. + +When a server is removed from configuration, "server not found" is a **terminal** condition -- no amount of retrying will make the server appear. This should be treated as a permanent failure. + +#### Bug 3: No Maximum Retry Limit + +The constant `DefaultMaxRetries = 0` means "unlimited retries." Combined with Bug 1 causing 0s delays, this allows the retry count to reach 23M+ without any circuit breaker stopping it. + +## Solution + +### Fix 1: Overflow-Safe Backoff Calculation + +Cap the shift exponent to prevent integer overflow. The maximum useful exponent before hitting `MaxRetryBackoff` (5 minutes = 300s) with `RetryBackoffBase` (10s) is 5 (`10s * 2^5 = 320s > 300s`). Cap the exponent at a safe value (e.g., 30) that prevents overflow while still allowing the backoff cap to apply naturally. + +Additionally, add a guard: if the computed backoff is zero or negative, use `MaxRetryBackoff` as a fallback. + +### Fix 2: Classify "Server Not Found" as Terminal Error + +Add `"server not found"` to the list of terminal (permanent) errors in `classifyRefreshError`. When a server is not in the configuration, retrying is pointless. The same applies to `"server does not use OAuth"`. + +Create a new error classification category: `"failed_server_gone"` for errors that indicate the server no longer exists or is not applicable for OAuth refresh. + +### Fix 3: Add Maximum Consecutive Retry Limit + +Change `DefaultMaxRetries` from `0` (unlimited) to `50`. After 50 consecutive failures, mark the refresh as permanently failed and stop retrying. This acts as a circuit breaker even if Fixes 1 and 2 don't cover an edge case. + +The retry limit is per-schedule and resets on success (via `handleRefreshSuccess` which sets `RetryCount = 0`). + +### Fix 4: Enforce Minimum Delay Floor + +In `rescheduleAfterDelay`, enforce that the delay is never less than `MinRefreshInterval` (5 seconds). This provides defense-in-depth against any code path that computes a 0 or negative delay. + +## Assumptions + +1. **50 max retries is sufficient**: With exponential backoff (10s, 20s, 40s, ... 300s cap), 50 retries spans approximately 2+ hours. If a transient issue isn't resolved in 2 hours, it's likely permanent. Users can re-trigger refresh via re-login. + +2. **"Server not found" is always terminal**: If a server was removed from config while having stored OAuth tokens, the refresh should stop immediately. The stale token record will be cleaned up separately (or via `OnTokenCleared`). + +3. **"Server does not use OAuth" is terminal**: Similar to "server not found" -- if the server config changed to remove OAuth, retrying is pointless. + +4. **The backoff exponent cap of 30 is safe**: `1 << 30` on a 64-bit platform is `1073741824`, which when multiplied by 10s (`10_000_000_000 ns`) gives `10_737_418_240_000_000_000 ns` -- still within int64 range but will be capped by `MaxRetryBackoff` anyway. We clamp before this point for defense-in-depth. + +5. **No configuration change needed**: The max retry limit is a code-level constant, not a user-facing config. This keeps the fix simple and secure by default. + +## Functional Requirements + +| ID | Requirement | +|----|-------------| +| FR-001 | `calculateBackoff` MUST NOT produce zero or negative durations for any retry count value | +| FR-002 | `classifyRefreshError` MUST classify "server not found" errors as terminal (non-retryable) | +| FR-003 | `classifyRefreshError` MUST classify "server does not use OAuth" errors as terminal | +| FR-004 | `handleRefreshFailure` MUST stop retrying after `DefaultMaxRetries` consecutive failures | +| FR-005 | `rescheduleAfterDelay` MUST enforce a minimum delay of `MinRefreshInterval` (5 seconds) | +| FR-006 | When max retries are exceeded, the schedule MUST transition to `RefreshStateFailed` | +| FR-007 | When a terminal error is detected, the schedule MUST transition to `RefreshStateFailed` immediately | +| FR-008 | A failure event MUST be emitted when retries are exhausted or a terminal error occurs | + +## Non-Functional Requirements + +| ID | Requirement | +|----|-------------| +| NFR-001 | All changes must pass `go test -race ./internal/oauth/...` | +| NFR-002 | All changes must compile with `go build ./...` | +| NFR-003 | No breaking changes to public APIs or configuration | +| NFR-004 | Backward-compatible: existing exponential backoff behavior preserved for normal retry ranges | + +## Testing Requirements + +| Test | Description | +|------|-------------| +| T001 | `calculateBackoff` returns correct values at overflow boundaries (retryCount=30, 63, 64, 100, 1000) | +| T002 | `calculateBackoff` never returns zero or negative for any retryCount 0-10000 | +| T003 | `classifyRefreshError` classifies "server not found" as terminal | +| T004 | `classifyRefreshError` classifies "server does not use OAuth" as terminal | +| T005 | `handleRefreshFailure` stops after max retries and emits failure event | +| T006 | `handleRefreshFailure` treats terminal errors same as invalid_grant (immediate stop) | +| T007 | `rescheduleAfterDelay` enforces minimum delay | +| T008 | Integration: server-not-found error stops refresh immediately (no retry) | From 8aebb8760cc6de9b4b3eba58c294a35de1b95834 Mon Sep 17 00:00:00 2001 From: Claude Code Date: Sun, 1 Mar 2026 17:00:52 +0200 Subject: [PATCH 2/2] chore: remove spec/plan/summary artifacts from PR Co-Authored-By: Claude Opus 4.6 --- autonomous_summary.md | 71 ------------------------------ plan.md | 87 ------------------------------------ spec.md | 100 ------------------------------------------ 3 files changed, 258 deletions(-) delete mode 100644 autonomous_summary.md delete mode 100644 plan.md delete mode 100644 spec.md diff --git a/autonomous_summary.md b/autonomous_summary.md deleted file mode 100644 index 6e92b65a..00000000 --- a/autonomous_summary.md +++ /dev/null @@ -1,71 +0,0 @@ -# Autonomous Summary: Fix OAuth Refresh Deadloop (Issue #310) - -## Problem - -When an OAuth server was removed from configuration or became unavailable, the refresh manager entered an infinite retry loop with 0-second delays, producing 23M+ retries and flooding logs. The only recovery was restarting mcpproxy. - -## Root Cause - -Three interconnected bugs in `internal/oauth/refresh_manager.go`: - -1. **Integer overflow in `calculateBackoff`**: The expression `1 << uint(retryCount)` overflows at retryCount >= 30 on 64-bit systems. At retryCount 30, the shift produces a value that, when multiplied by `RetryBackoffBase` (10s in nanoseconds), overflows `int64` to a negative value. The comparison `backoff > MaxRetryBackoff` fails to catch negative values. At retryCount >= 64, the shift wraps to 0, producing `backoff = 0s` permanently. - -2. **"Server not found" classified as retryable**: The error `"server not found: gcw2"` from `upstream/manager.go` was classified as `"failed_other"` — neither terminal nor network-retryable. This caused it to enter the retry path, which eventually hit the overflow bug. - -3. **No maximum retry limit**: `DefaultMaxRetries = 0` meant unlimited retries, so once the backoff dropped to 0s, retries looped infinitely with no circuit breaker. - -## Changes Made - -### `internal/oauth/refresh_manager.go` - -1. **Fixed `calculateBackoff` overflow** (lines 800-820): - - Added `maxBackoffExponent = 25` constant - - When `retryCount > maxBackoffExponent`, returns `MaxRetryBackoff` immediately - - Added post-computation guard: if `backoff <= 0`, use `MaxRetryBackoff` - -2. **Added terminal error classification** (lines 760-770): - - New classification `"failed_server_gone"` for "server not found" and "server does not use OAuth" errors - - These are checked before network/grant errors since they should never be retried - -3. **Handle terminal errors in `handleRefreshFailure`** (lines 718-733): - - `"failed_server_gone"` errors immediately set `RefreshStateFailed` and emit failure event - - No retry scheduled — the server won't magically reappear - -4. **Changed `DefaultMaxRetries` from 0 to 50** (line 25): - - Circuit breaker: after 50 consecutive failures, stops retrying - - With exponential backoff (10s base, 5min cap), 50 retries spans ~2+ hours - -5. **Check max retries in `handleRefreshFailure`** (lines 735-750): - - When `retryCount >= maxRetries`, sets `RefreshStateFailed` and emits failure event - -6. **Enforced minimum delay in `rescheduleAfterDelay`** (lines 825-828): - - If `delay < MinRefreshInterval` (5s), uses `MinRefreshInterval` - - Defense-in-depth against any code path producing 0 or negative delays - -### `internal/oauth/refresh_manager_test.go` - -Added 6 new test functions: - -- `TestRefreshManager_BackoffOverflowProtection`: Verifies backoff is positive and capped at boundaries (retryCount 30, 63, 64, 100, 1000, 23158728) and exhaustively for 0-10000 -- `TestRefreshManager_ServerNotFoundIsTerminal`: Verifies "server not found" stops refresh immediately with no retry -- `TestRefreshManager_MaxRetryLimit`: Verifies max retry limit stops retries and emits failure event -- `TestDefaultMaxRetries_IsNonZero`: Verifies the constant is non-zero -- `TestRefreshManager_MinimumDelayEnforced`: Verifies 0 and negative delays are clamped to MinRefreshInterval -- Extended `TestClassifyRefreshError` with 3 new cases for server-gone errors - -## Test Results - -``` -go test -race ./internal/oauth/... -v → PASS (17.2s) -go build ./internal/oauth/... → OK -go vet ./internal/oauth/... → OK -``` - -All 28 tests in the oauth package pass, including all pre-existing tests (backward compatible). - -## Verification - -The fix addresses all three failure modes from the issue: -- **Overflow**: `calculateBackoff(23158728)` now returns `5m0s` (was `0s`) -- **Terminal errors**: "server not found" immediately stops (was infinite retry) -- **Circuit breaker**: Even unknown errors stop after 50 retries (was unlimited) diff --git a/plan.md b/plan.md deleted file mode 100644 index 9e66fd53..00000000 --- a/plan.md +++ /dev/null @@ -1,87 +0,0 @@ -# Implementation Plan: Fix OAuth Refresh Deadloop (Issue #310) - -## Overview - -Four changes to `internal/oauth/refresh_manager.go` and updates to `internal/oauth/refresh_manager_test.go`, implemented in TDD order. - -## Step 1: Fix `calculateBackoff` Integer Overflow - -### Test First -Add `TestRefreshManager_BackoffOverflowProtection` to verify: -- `calculateBackoff(30)` returns `MaxRetryBackoff` (not negative/zero) -- `calculateBackoff(63)` returns `MaxRetryBackoff` (not zero) -- `calculateBackoff(64)` returns `MaxRetryBackoff` (not zero) -- `calculateBackoff(1000)` returns `MaxRetryBackoff` (not zero) -- Loop 0..10000 verifying all results > 0 and <= MaxRetryBackoff - -### Implementation -In `calculateBackoff`: -- Cap `retryCount` to a safe maximum exponent (e.g., 30) to prevent overflow -- Add post-computation guard: if `backoff <= 0`, set to `MaxRetryBackoff` - -## Step 2: Add Terminal Error Classification for "Server Not Found" - -### Test First -Add test cases to `TestClassifyRefreshError`: -- `"server not found: gcw2"` -> `"failed_server_gone"` -- `"server does not use OAuth: gcw2"` -> `"failed_server_gone"` - -Add `TestRefreshManager_ServerNotFoundIsTerminal` to verify: -- `handleRefreshFailure` with "server not found" error stops immediately -- Schedule transitions to `RefreshStateFailed` -- Failure event is emitted - -### Implementation -In `classifyRefreshError`: -- Add new terminal error patterns: `"server not found"`, `"server does not use OAuth"` -- Return `"failed_server_gone"` for these patterns - -In `handleRefreshFailure`: -- Handle `"failed_server_gone"` the same as `"failed_invalid_grant"` (immediate stop, no retry) - -## Step 3: Add Maximum Retry Limit - -### Test First -Add `TestRefreshManager_MaxRetryLimit` to verify: -- After `DefaultMaxRetries` (50) consecutive failures, schedule transitions to `RefreshStateFailed` -- Failure event is emitted when limit is reached - -### Implementation -- Change `DefaultMaxRetries` from `0` to `50` -- In `handleRefreshFailure`, after classifying the error and before calculating backoff: - - Check if `retryCount >= m.maxRetries` (when `m.maxRetries > 0`) - - If exceeded, log error, set state to `RefreshStateFailed`, emit event, return - -Note: Update `NewRefreshManager` to use the new default -- currently `if config.MaxRetries > 0` would override, so 0 still means "use default of 50". - -## Step 4: Enforce Minimum Delay Floor in `rescheduleAfterDelay` - -### Test First -Add `TestRefreshManager_MinimumDelayEnforced` to verify: -- `rescheduleAfterDelay` with 0 delay uses `MinRefreshInterval` instead -- `rescheduleAfterDelay` with negative delay uses `MinRefreshInterval` instead - -### Implementation -In `rescheduleAfterDelay`: -- Add check: `if delay < MinRefreshInterval { delay = MinRefreshInterval }` - -## Step 5: Verify and Commit - -1. Run `go test -race ./internal/oauth/... -v` -2. Run `go build ./...` -3. Write `autonomous_summary.md` -4. Commit all changes on branch `fix/oauth-refresh-deadloop-310` - -## Files Modified - -| File | Changes | -|------|---------| -| `internal/oauth/refresh_manager.go` | Fix `calculateBackoff`, add terminal error handling in `handleRefreshFailure`, enforce min delay in `rescheduleAfterDelay`, change `DefaultMaxRetries` | -| `internal/oauth/refresh_manager_test.go` | Add 5+ new test functions covering overflow, terminal errors, max retries, min delay | - -## Risk Assessment - -- **Low risk**: All changes are defensive additions (caps, guards, new classifications) -- **No behavioral change for normal operations**: Backoff sequence 10s->20s->40s->80s->160s->300s unchanged for retryCount 0-5 -- **No API changes**: No public interface modifications -- **Backward compatible**: Existing configs continue to work diff --git a/spec.md b/spec.md deleted file mode 100644 index 14feba01..00000000 --- a/spec.md +++ /dev/null @@ -1,100 +0,0 @@ -# Spec: Fix OAuth Refresh Deadloop (Issue #310) - -## Problem Statement - -When an OAuth server is removed from configuration or becomes unavailable, the refresh manager enters an infinite retry loop with 0-second delays, producing millions of retries per minute and spamming logs. The only recovery is restarting mcpproxy. - -### Root Cause Analysis - -There are **three distinct bugs** contributing to this issue: - -#### Bug 1: Integer Overflow in Backoff Calculation - -The `calculateBackoff` function uses `1 << uint(retryCount)` which overflows on 64-bit systems: - -- At `retryCount >= 30`: `1 << 30 = 1073741824`, multiplied by `10s` base overflows `time.Duration` (int64 nanoseconds) to a **negative value**. The `backoff > MaxRetryBackoff` check does NOT catch negative values, so the uncapped negative duration is used. -- At `retryCount >= 63`: `1 << 63` overflows to the minimum int64 value, and `1 << 64 = 0`, giving `backoff = 0s`. -- At `retryCount >= 64`: All further shifts produce `0`, giving `backoff = 0s` permanently. - -This means after ~30 retries (about 5 minutes of exponential backoff), the delay drops to 0 and retries become infinite tight loops. - -#### Bug 2: "Server Not Found" Classified as Retryable - -The error `"server not found: gcw2"` from `upstream/manager.go` is classified as `"failed_other"` by `classifyRefreshError`. This classification does NOT trigger the permanent failure path (only `"failed_invalid_grant"` does). Instead, it enters the retry path with backoff, which then hits Bug 1. - -When a server is removed from configuration, "server not found" is a **terminal** condition -- no amount of retrying will make the server appear. This should be treated as a permanent failure. - -#### Bug 3: No Maximum Retry Limit - -The constant `DefaultMaxRetries = 0` means "unlimited retries." Combined with Bug 1 causing 0s delays, this allows the retry count to reach 23M+ without any circuit breaker stopping it. - -## Solution - -### Fix 1: Overflow-Safe Backoff Calculation - -Cap the shift exponent to prevent integer overflow. The maximum useful exponent before hitting `MaxRetryBackoff` (5 minutes = 300s) with `RetryBackoffBase` (10s) is 5 (`10s * 2^5 = 320s > 300s`). Cap the exponent at a safe value (e.g., 30) that prevents overflow while still allowing the backoff cap to apply naturally. - -Additionally, add a guard: if the computed backoff is zero or negative, use `MaxRetryBackoff` as a fallback. - -### Fix 2: Classify "Server Not Found" as Terminal Error - -Add `"server not found"` to the list of terminal (permanent) errors in `classifyRefreshError`. When a server is not in the configuration, retrying is pointless. The same applies to `"server does not use OAuth"`. - -Create a new error classification category: `"failed_server_gone"` for errors that indicate the server no longer exists or is not applicable for OAuth refresh. - -### Fix 3: Add Maximum Consecutive Retry Limit - -Change `DefaultMaxRetries` from `0` (unlimited) to `50`. After 50 consecutive failures, mark the refresh as permanently failed and stop retrying. This acts as a circuit breaker even if Fixes 1 and 2 don't cover an edge case. - -The retry limit is per-schedule and resets on success (via `handleRefreshSuccess` which sets `RetryCount = 0`). - -### Fix 4: Enforce Minimum Delay Floor - -In `rescheduleAfterDelay`, enforce that the delay is never less than `MinRefreshInterval` (5 seconds). This provides defense-in-depth against any code path that computes a 0 or negative delay. - -## Assumptions - -1. **50 max retries is sufficient**: With exponential backoff (10s, 20s, 40s, ... 300s cap), 50 retries spans approximately 2+ hours. If a transient issue isn't resolved in 2 hours, it's likely permanent. Users can re-trigger refresh via re-login. - -2. **"Server not found" is always terminal**: If a server was removed from config while having stored OAuth tokens, the refresh should stop immediately. The stale token record will be cleaned up separately (or via `OnTokenCleared`). - -3. **"Server does not use OAuth" is terminal**: Similar to "server not found" -- if the server config changed to remove OAuth, retrying is pointless. - -4. **The backoff exponent cap of 30 is safe**: `1 << 30` on a 64-bit platform is `1073741824`, which when multiplied by 10s (`10_000_000_000 ns`) gives `10_737_418_240_000_000_000 ns` -- still within int64 range but will be capped by `MaxRetryBackoff` anyway. We clamp before this point for defense-in-depth. - -5. **No configuration change needed**: The max retry limit is a code-level constant, not a user-facing config. This keeps the fix simple and secure by default. - -## Functional Requirements - -| ID | Requirement | -|----|-------------| -| FR-001 | `calculateBackoff` MUST NOT produce zero or negative durations for any retry count value | -| FR-002 | `classifyRefreshError` MUST classify "server not found" errors as terminal (non-retryable) | -| FR-003 | `classifyRefreshError` MUST classify "server does not use OAuth" errors as terminal | -| FR-004 | `handleRefreshFailure` MUST stop retrying after `DefaultMaxRetries` consecutive failures | -| FR-005 | `rescheduleAfterDelay` MUST enforce a minimum delay of `MinRefreshInterval` (5 seconds) | -| FR-006 | When max retries are exceeded, the schedule MUST transition to `RefreshStateFailed` | -| FR-007 | When a terminal error is detected, the schedule MUST transition to `RefreshStateFailed` immediately | -| FR-008 | A failure event MUST be emitted when retries are exhausted or a terminal error occurs | - -## Non-Functional Requirements - -| ID | Requirement | -|----|-------------| -| NFR-001 | All changes must pass `go test -race ./internal/oauth/...` | -| NFR-002 | All changes must compile with `go build ./...` | -| NFR-003 | No breaking changes to public APIs or configuration | -| NFR-004 | Backward-compatible: existing exponential backoff behavior preserved for normal retry ranges | - -## Testing Requirements - -| Test | Description | -|------|-------------| -| T001 | `calculateBackoff` returns correct values at overflow boundaries (retryCount=30, 63, 64, 100, 1000) | -| T002 | `calculateBackoff` never returns zero or negative for any retryCount 0-10000 | -| T003 | `classifyRefreshError` classifies "server not found" as terminal | -| T004 | `classifyRefreshError` classifies "server does not use OAuth" as terminal | -| T005 | `handleRefreshFailure` stops after max retries and emits failure event | -| T006 | `handleRefreshFailure` treats terminal errors same as invalid_grant (immediate stop) | -| T007 | `rescheduleAfterDelay` enforces minimum delay | -| T008 | Integration: server-not-found error stops refresh immediately (no retry) |