From b151ad4fd30676fea763e0cb66b159a20d1f912b Mon Sep 17 00:00:00 2001
From: Claude Code <noreply@anthropic.com>
Date: Sun, 1 Mar 2026 16:29:55 +0200
Subject: [PATCH 1/2] fix(oauth): prevent refresh manager deadloop on
 server-not-found errors [#310]

Fix three bugs causing infinite 0-delay retry loops when an OAuth
server is removed from config or becomes unavailable:

1. Integer overflow in calculateBackoff: 1<<uint(retryCount) overflows
   at retryCount>=30, producing negative then zero durations. Cap the
   exponent at 25 and guard against non-positive results.

2. "server not found" errors classified as retryable: Add new terminal
   classification "failed_server_gone" for server-not-found and
   server-does-not-use-OAuth errors that stop retries immediately.

3. No maximum retry limit: Change DefaultMaxRetries from 0 (unlimited)
   to 50 as a circuit breaker. Also enforce MinRefreshInterval (5s) as
   minimum delay floor in rescheduleAfterDelay.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 autonomous_summary.md                  |  71 +++++++++++
 internal/oauth/refresh_manager.go      |  90 +++++++++++--
 internal/oauth/refresh_manager_test.go | 167 +++++++++++++++++++++++++
 plan.md                                |  87 +++++++++++++
 spec.md                                | 100 +++++++++++++++
 5 files changed, 507 insertions(+), 8 deletions(-)
 create mode 100644 autonomous_summary.md
 create mode 100644 plan.md
 create mode 100644 spec.md

diff --git a/autonomous_summary.md b/autonomous_summary.md
new file mode 100644
index 00000000..6e92b65a
--- /dev/null
+++ b/autonomous_summary.md
@@ -0,0 +1,71 @@
+# Autonomous Summary: Fix OAuth Refresh Deadloop (Issue #310)
+
+## Problem
+
+When an OAuth server was removed from configuration or became unavailable, the refresh manager entered an infinite retry loop with 0-second delays, producing 23M+ retries and flooding logs. The only recovery was restarting mcpproxy.
+
+## Root Cause
+
+Three interconnected bugs in `internal/oauth/refresh_manager.go`:
+
+1. **Integer overflow in `calculateBackoff`**: The expression `1 << uint(retryCount)` overflows at retryCount >= 30 on 64-bit systems. At retryCount 30, the shift produces a value that, when multiplied by `RetryBackoffBase` (10s in nanoseconds), overflows `int64` to a negative value. The comparison `backoff > MaxRetryBackoff` fails to catch negative values. At retryCount >= 64, the shift wraps to 0, producing `backoff = 0s` permanently.
+
+2. **"Server not found" classified as retryable**: The error `"server not found: gcw2"` from `upstream/manager.go` was classified as `"failed_other"` — neither terminal nor network-retryable. This caused it to enter the retry path, which eventually hit the overflow bug.
+
+3. **No maximum retry limit**: `DefaultMaxRetries = 0` meant unlimited retries, so once the backoff dropped to 0s, retries looped infinitely with no circuit breaker.
+
+## Changes Made
+
+### `internal/oauth/refresh_manager.go`
+
+1. **Fixed `calculateBackoff` overflow** (lines 800-820):
+   - Added `maxBackoffExponent = 25` constant
+   - When `retryCount > maxBackoffExponent`, returns `MaxRetryBackoff` immediately
+   - Added post-computation guard: if `backoff <= 0`, use `MaxRetryBackoff`
+
+2. **Added terminal error classification** (lines 760-770):
+   - New classification `"failed_server_gone"` for "server not found" and "server does not use OAuth" errors
+   - These are checked before network/grant errors since they should never be retried
+
+3. **Handle terminal errors in `handleRefreshFailure`** (lines 718-733):
+   - `"failed_server_gone"` errors immediately set `RefreshStateFailed` and emit failure event
+   - No retry scheduled — the server won't magically reappear
+
+4. **Changed `DefaultMaxRetries` from 0 to 50** (line 25):
+   - Circuit breaker: after 50 consecutive failures, stops retrying
+   - With exponential backoff (10s base, 5min cap), 50 retries spans ~2+ hours
+
+5. **Check max retries in `handleRefreshFailure`** (lines 735-750):
+   - When `retryCount >= maxRetries`, sets `RefreshStateFailed` and emits failure event
+
+6. **Enforced minimum delay in `rescheduleAfterDelay`** (lines 825-828):
+   - If `delay < MinRefreshInterval` (5s), uses `MinRefreshInterval`
+   - Defense-in-depth against any code path producing 0 or negative delays
+
+### `internal/oauth/refresh_manager_test.go`
+
+Added 6 new test functions:
+
+- `TestRefreshManager_BackoffOverflowProtection`: Verifies backoff is positive and capped at boundaries (retryCount 30, 63, 64, 100, 1000, 23158728) and exhaustively for 0-10000
+- `TestRefreshManager_ServerNotFoundIsTerminal`: Verifies "server not found" stops refresh immediately with no retry
+- `TestRefreshManager_MaxRetryLimit`: Verifies max retry limit stops retries and emits failure event
+- `TestDefaultMaxRetries_IsNonZero`: Verifies the constant is non-zero
+- `TestRefreshManager_MinimumDelayEnforced`: Verifies 0 and negative delays are clamped to MinRefreshInterval
+- Extended `TestClassifyRefreshError` with 3 new cases for server-gone errors
+
+## Test Results
+
+```
+go test -race ./internal/oauth/... -v  →  PASS (17.2s)
+go build ./internal/oauth/...          →  OK
+go vet ./internal/oauth/...            →  OK
+```
+
+All 28 tests in the oauth package pass, including all pre-existing tests (backward compatible).
+
+## Verification
+
+The fix addresses all three failure modes from the issue:
+- **Overflow**: `calculateBackoff(23158728)` now returns `5m0s` (was `0s`)
+- **Terminal errors**: "server not found" immediately stops (was infinite retry)
+- **Circuit breaker**: Even unknown errors stop after 50 retries (was unlimited)
diff --git a/internal/oauth/refresh_manager.go b/internal/oauth/refresh_manager.go
index a49a0a8c..a782f646 100644
--- a/internal/oauth/refresh_manager.go
+++ b/internal/oauth/refresh_manager.go
@@ -20,9 +20,10 @@ const (
 	// 0.75 means refresh at 75% of lifetime for long-lived tokens.
 	DefaultRefreshThreshold = 0.75
 
-	// DefaultMaxRetries is the maximum number of refresh attempts before giving up.
-	// Set to 0 for unlimited retries until token expiration (FR-009).
-	DefaultMaxRetries = 0
+	// DefaultMaxRetries is the maximum number of consecutive refresh attempts before giving up.
+	// Acts as a circuit breaker to prevent infinite retry loops (issue #310).
+	// With exponential backoff (10s base, 5min cap), 50 retries spans ~2+ hours.
+	DefaultMaxRetries = 50
 
 	// MinRefreshInterval prevents too-frequent refresh attempts.
 	MinRefreshInterval = 5 * time.Second
@@ -672,7 +673,8 @@ func (m *RefreshManager) handleRefreshSuccess(serverName string) {
 }
 
 // handleRefreshFailure handles a failed token refresh with exponential backoff retry.
-// Per FR-009: Retries continue until token expiration (unlimited retries), not a fixed count.
+// Terminal errors (invalid_grant, server not found) stop immediately.
+// Transient errors retry with exponential backoff up to maxRetries.
 func (m *RefreshManager) handleRefreshFailure(serverName string, err error) {
 	m.mu.Lock()
 	schedule := m.schedules[serverName]
@@ -715,8 +717,48 @@ func (m *RefreshManager) handleRefreshFailure(serverName string, err error) {
 		return
 	}
 
-	// Check if we should continue retrying (unlimited retries until token expiration per FR-009)
-	// Only stop if the access token has completely expired AND no more time remains
+	// Check if the server is gone (removed from config or not OAuth).
+	// No amount of retrying will fix this - stop immediately.
+	if errorType == "failed_server_gone" {
+		m.logger.Error("OAuth refresh stopped - server no longer available",
+			zap.String("server", serverName),
+			zap.Error(err))
+
+		m.mu.Lock()
+		if schedule := m.schedules[serverName]; schedule != nil {
+			schedule.RefreshState = RefreshStateFailed
+			schedule.LastError = err.Error()
+		}
+		m.mu.Unlock()
+
+		if m.eventEmitter != nil {
+			m.eventEmitter.EmitOAuthRefreshFailed(serverName, err.Error())
+		}
+		return
+	}
+
+	// Check if max retries exceeded (circuit breaker to prevent infinite loops).
+	if m.maxRetries > 0 && retryCount >= m.maxRetries {
+		m.logger.Error("OAuth token refresh failed - max retries exceeded",
+			zap.String("server", serverName),
+			zap.Int("max_retries", m.maxRetries),
+			zap.Int("retry_count", retryCount))
+
+		m.mu.Lock()
+		if schedule := m.schedules[serverName]; schedule != nil {
+			schedule.RefreshState = RefreshStateFailed
+			schedule.LastError = "Max retries exceeded - re-authentication required"
+		}
+		m.mu.Unlock()
+
+		if m.eventEmitter != nil {
+			m.eventEmitter.EmitOAuthRefreshFailed(serverName, err.Error())
+		}
+		return
+	}
+
+	// Check if we should continue retrying based on token expiration.
+	// Only stop if the access token has completely expired AND no more time remains.
 	now := time.Now()
 	if !expiresAt.IsZero() && now.After(expiresAt) {
 		// Token has already expired - check if we should give up
@@ -748,7 +790,7 @@ func (m *RefreshManager) handleRefreshFailure(serverName string, err error) {
 }
 
 // classifyRefreshError categorizes a refresh error for metrics and error handling.
-// Returns one of: "failed_network", "failed_invalid_grant", "failed_other".
+// Returns one of: "failed_network", "failed_invalid_grant", "failed_server_gone", "failed_other".
 func classifyRefreshError(err error) string {
 	if err == nil {
 		return "success"
@@ -756,6 +798,18 @@ func classifyRefreshError(err error) string {
 
 	errStr := err.Error()
 
+	// Check for terminal server-gone errors (server removed from config or not OAuth).
+	// These should never be retried because the server no longer exists or doesn't use OAuth.
+	serverGoneErrors := []string{
+		"server not found",
+		"server does not use OAuth",
+	}
+	for _, pattern := range serverGoneErrors {
+		if stringutil.ContainsIgnoreCase(errStr, pattern) {
+			return "failed_server_gone"
+		}
+	}
+
 	// Check for permanent OAuth errors (refresh token invalid/expired)
 	permanentErrors := []string{
 		"invalid_grant",
@@ -789,15 +843,29 @@ func classifyRefreshError(err error) string {
 	return "failed_other"
 }
 
+// maxBackoffExponent is the maximum shift exponent that won't overflow when
+// multiplied by RetryBackoffBase (10s = 10_000_000_000 ns). On 64-bit systems,
+// 1<<30 * 10e9 overflows int64. We cap at 25 which gives 10s * 2^25 = 335,544,320s
+// — well above MaxRetryBackoff (300s), so the cap applies naturally.
+const maxBackoffExponent = 25
+
 // calculateBackoff calculates the exponential backoff duration for a given retry count.
 // The formula is: base * 2^retryCount, capped at MaxRetryBackoff (5 minutes).
 // Sequence: 10s → 20s → 40s → 80s → 160s → 300s (cap).
+//
+// The exponent is capped at maxBackoffExponent to prevent integer overflow
+// that caused issue #310 (0s delay at high retry counts leading to infinite loops).
 func (m *RefreshManager) calculateBackoff(retryCount int) time.Duration {
 	if retryCount < 0 {
 		retryCount = 0
 	}
+	// Cap the exponent to prevent integer overflow.
+	// Beyond this exponent, the result would exceed MaxRetryBackoff anyway.
+	if retryCount > maxBackoffExponent {
+		return MaxRetryBackoff
+	}
 	backoff := RetryBackoffBase * time.Duration(1<<uint(retryCount))
-	if backoff > MaxRetryBackoff {
+	if backoff <= 0 || backoff > MaxRetryBackoff {
 		backoff = MaxRetryBackoff
 	}
 	return backoff
@@ -814,7 +882,13 @@ func (m *RefreshManager) isRateLimited(schedule *RefreshSchedule) bool {
 }
 
 // rescheduleAfterDelay reschedules a refresh attempt after a delay.
+// The delay is enforced to be at least MinRefreshInterval to prevent tight loops.
 func (m *RefreshManager) rescheduleAfterDelay(serverName string, delay time.Duration) {
+	// Enforce minimum delay to prevent tight retry loops (defense-in-depth for issue #310).
+	if delay < MinRefreshInterval {
+		delay = MinRefreshInterval
+	}
+
 	m.mu.Lock()
 	defer m.mu.Unlock()
 
diff --git a/internal/oauth/refresh_manager_test.go b/internal/oauth/refresh_manager_test.go
index c79fbd8d..fef869ee 100644
--- a/internal/oauth/refresh_manager_test.go
+++ b/internal/oauth/refresh_manager_test.go
@@ -647,6 +647,10 @@ func TestClassifyRefreshError(t *testing.T) {
 		{"context deadline exceeded", errors.New("context deadline exceeded"), "failed_network"},
 		{"generic error", errors.New("unknown server error"), "failed_other"},
 		{"server_error", errors.New("OAuth server_error"), "failed_other"},
+		// Terminal errors: server gone
+		{"server not found", errors.New("server not found: gcw2"), "failed_server_gone"},
+		{"server not found wrapped", errors.New("failed to refresh OAuth token: server not found: myserver"), "failed_server_gone"},
+		{"server does not use OAuth", errors.New("server does not use OAuth: gcw2"), "failed_server_gone"},
 	}
 
 	for _, tt := range tests {
@@ -656,3 +660,166 @@ func TestClassifyRefreshError(t *testing.T) {
 		})
 	}
 }
+
+// Test that calculateBackoff never returns zero or negative for any retry count (overflow protection).
+func TestRefreshManager_BackoffOverflowProtection(t *testing.T) {
+	logger := zaptest.NewLogger(t)
+	manager := NewRefreshManager(nil, nil, nil, logger)
+
+	// Specific boundary cases that triggered the original bug
+	boundaryTests := []struct {
+		retryCount int
+		desc       string
+	}{
+		{30, "overflow boundary on 64-bit (10s * 2^30 overflows nanoseconds)"},
+		{63, "1<<63 is min int64, duration becomes 0 or negative"},
+		{64, "1<<64 wraps to 0 on 64-bit"},
+		{100, "well beyond overflow"},
+		{1000, "extreme retry count"},
+		{23158728, "actual retry count from issue #310"},
+	}
+
+	for _, tt := range boundaryTests {
+		t.Run(tt.desc, func(t *testing.T) {
+			backoff := manager.calculateBackoff(tt.retryCount)
+			assert.Greater(t, backoff, time.Duration(0),
+				"Backoff must be positive at retryCount=%d", tt.retryCount)
+			assert.LessOrEqual(t, backoff, MaxRetryBackoff,
+				"Backoff must not exceed MaxRetryBackoff at retryCount=%d", tt.retryCount)
+		})
+	}
+
+	// Exhaustive check: no retry count from 0 to 10000 produces zero or negative backoff
+	for i := 0; i <= 10000; i++ {
+		backoff := manager.calculateBackoff(i)
+		if backoff <= 0 || backoff > MaxRetryBackoff {
+			t.Fatalf("calculateBackoff(%d) = %v, want positive and <= %v", i, backoff, MaxRetryBackoff)
+		}
+	}
+}
+
+// Test that "server not found" is treated as terminal (no retry).
+func TestRefreshManager_ServerNotFoundIsTerminal(t *testing.T) {
+	logger := zaptest.NewLogger(t)
+	store := newMockTokenStore()
+	emitter := &mockEventEmitter{}
+
+	manager := NewRefreshManager(store, nil, nil, logger)
+	runtime := &mockRuntime{
+		refreshErr: errors.New("failed to refresh OAuth token: server not found: gcw2"),
+	}
+	manager.SetRuntime(runtime)
+	manager.SetEventEmitter(emitter)
+
+	ctx := context.Background()
+	err := manager.Start(ctx)
+	require.NoError(t, err)
+	defer manager.Stop()
+
+	// Create a schedule and trigger refresh
+	expiresAt := time.Now().Add(1 * time.Hour)
+	manager.OnTokenSaved("gcw2", expiresAt)
+	time.Sleep(50 * time.Millisecond)
+
+	// Execute refresh which will fail with "server not found"
+	manager.executeRefresh("gcw2")
+	time.Sleep(50 * time.Millisecond)
+
+	// Should be in failed state (not retrying)
+	schedule := manager.GetSchedule("gcw2")
+	require.NotNil(t, schedule)
+	assert.Equal(t, RefreshStateFailed, schedule.RefreshState,
+		"Server not found should immediately fail, not retry")
+	assert.Equal(t, 1, schedule.RetryCount, "Should have exactly 1 attempt")
+
+	// Failure event should be emitted
+	assert.Equal(t, 1, emitter.GetFailedEvents(),
+		"Should emit failure event for terminal server-not-found error")
+}
+
+// Test that max retry limit stops retries.
+func TestRefreshManager_MaxRetryLimit(t *testing.T) {
+	logger := zaptest.NewLogger(t)
+	store := newMockTokenStore()
+	emitter := &mockEventEmitter{}
+
+	config := &RefreshManagerConfig{
+		Threshold:  0.1,
+		MaxRetries: 5, // Low limit for testing
+	}
+
+	manager := NewRefreshManager(store, nil, config, logger)
+	runtime := &mockRuntime{
+		refreshErr: errors.New("some transient error"),
+	}
+	manager.SetRuntime(runtime)
+	manager.SetEventEmitter(emitter)
+
+	ctx := context.Background()
+	err := manager.Start(ctx)
+	require.NoError(t, err)
+	defer manager.Stop()
+
+	// Create a schedule
+	expiresAt := time.Now().Add(1 * time.Hour)
+	manager.OnTokenSaved("test-server", expiresAt)
+	time.Sleep(50 * time.Millisecond)
+
+	// Simulate failures beyond max retries
+	for i := 0; i < 7; i++ {
+		manager.executeRefresh("test-server")
+		time.Sleep(10 * time.Millisecond)
+	}
+
+	// After exceeding max retries, should be in failed state
+	schedule := manager.GetSchedule("test-server")
+	require.NotNil(t, schedule)
+	assert.Equal(t, RefreshStateFailed, schedule.RefreshState,
+		"Should transition to failed after max retries exceeded")
+
+	// Failure event should be emitted
+	assert.GreaterOrEqual(t, emitter.GetFailedEvents(), 1,
+		"Should emit failure event when max retries exceeded")
+}
+
+// Test that DefaultMaxRetries is now non-zero.
+func TestDefaultMaxRetries_IsNonZero(t *testing.T) {
+	assert.Greater(t, DefaultMaxRetries, 0,
+		"DefaultMaxRetries must be non-zero to prevent infinite retry loops")
+}
+
+// Test that rescheduleAfterDelay enforces minimum delay.
+func TestRefreshManager_MinimumDelayEnforced(t *testing.T) {
+	logger := zaptest.NewLogger(t)
+	store := newMockTokenStore()
+
+	manager := NewRefreshManager(store, nil, nil, logger)
+	runtime := &mockRuntime{}
+	manager.SetRuntime(runtime)
+
+	ctx := context.Background()
+	err := manager.Start(ctx)
+	require.NoError(t, err)
+	defer manager.Stop()
+
+	// Create a schedule first
+	expiresAt := time.Now().Add(1 * time.Hour)
+	manager.OnTokenSaved("test-server", expiresAt)
+	time.Sleep(50 * time.Millisecond)
+
+	// Call rescheduleAfterDelay with zero delay
+	manager.rescheduleAfterDelay("test-server", 0)
+
+	schedule := manager.GetSchedule("test-server")
+	require.NotNil(t, schedule)
+	assert.GreaterOrEqual(t, schedule.RetryBackoff, MinRefreshInterval,
+		"Delay should be at least MinRefreshInterval even when 0 is passed")
+
+	// Call rescheduleAfterDelay with negative delay
+	manager.rescheduleAfterDelay("test-server", -5*time.Second)
+
+	schedule = manager.GetSchedule("test-server")
+	require.NotNil(t, schedule)
+	assert.GreaterOrEqual(t, schedule.RetryBackoff, MinRefreshInterval,
+		"Delay should be at least MinRefreshInterval even when negative is passed")
+}
diff --git a/plan.md b/plan.md
new file mode 100644
index 00000000..9e66fd53
--- /dev/null
+++ b/plan.md
@@ -0,0 +1,87 @@
+# Implementation Plan: Fix OAuth Refresh Deadloop (Issue #310)
+
+## Overview
+
+Four changes to `internal/oauth/refresh_manager.go` and updates to `internal/oauth/refresh_manager_test.go`, implemented in TDD order.
+
+## Step 1: Fix `calculateBackoff` Integer Overflow
+
+### Test First
+Add `TestRefreshManager_BackoffOverflowProtection` to verify:
+- `calculateBackoff(30)` returns `MaxRetryBackoff` (not negative/zero)
+- `calculateBackoff(63)` returns `MaxRetryBackoff` (not zero)
+- `calculateBackoff(64)` returns `MaxRetryBackoff` (not zero)
+- `calculateBackoff(1000)` returns `MaxRetryBackoff` (not zero)
+- Loop 0..10000 verifying all results > 0 and <= MaxRetryBackoff
+
+### Implementation
+In `calculateBackoff`:
+- Cap `retryCount` to a safe maximum exponent (e.g., 30) to prevent overflow
+- Add post-computation guard: if `backoff <= 0`, set to `MaxRetryBackoff`
+
+## Step 2: Add Terminal Error Classification for "Server Not Found"
+
+### Test First
+Add test cases to `TestClassifyRefreshError`:
+- `"server not found: gcw2"` -> `"failed_server_gone"`
+- `"server does not use OAuth: gcw2"` -> `"failed_server_gone"`
+
+Add `TestRefreshManager_ServerNotFoundIsTerminal` to verify:
+- `handleRefreshFailure` with "server not found" error stops immediately
+- Schedule transitions to `RefreshStateFailed`
+- Failure event is emitted
+
+### Implementation
+In `classifyRefreshError`:
+- Add new terminal error patterns: `"server not found"`, `"server does not use OAuth"`
+- Return `"failed_server_gone"` for these patterns
+
+In `handleRefreshFailure`:
+- Handle `"failed_server_gone"` the same as `"failed_invalid_grant"` (immediate stop, no retry)
+
+## Step 3: Add Maximum Retry Limit
+
+### Test First
+Add `TestRefreshManager_MaxRetryLimit` to verify:
+- After `DefaultMaxRetries` (50) consecutive failures, schedule transitions to `RefreshStateFailed`
+- Failure event is emitted when limit is reached
+
+### Implementation
+- Change `DefaultMaxRetries` from `0` to `50`
+- In `handleRefreshFailure`, after classifying the error and before calculating backoff:
+  - Check if `retryCount >= m.maxRetries` (when `m.maxRetries > 0`)
+  - If exceeded, log error, set state to `RefreshStateFailed`, emit event, return
+
+Note: Update `NewRefreshManager` to use the new default -- currently `if config.MaxRetries > 0` would override, so 0 still means "use default of 50".
+
+## Step 4: Enforce Minimum Delay Floor in `rescheduleAfterDelay`
+
+### Test First
+Add `TestRefreshManager_MinimumDelayEnforced` to verify:
+- `rescheduleAfterDelay` with 0 delay uses `MinRefreshInterval` instead
+- `rescheduleAfterDelay` with negative delay uses `MinRefreshInterval` instead
+
+### Implementation
+In `rescheduleAfterDelay`:
+- Add check: `if delay < MinRefreshInterval { delay = MinRefreshInterval }`
+
+## Step 5: Verify and Commit
+
+1. Run `go test -race ./internal/oauth/... -v`
+2. Run `go build ./...`
+3. Write `autonomous_summary.md`
+4. Commit all changes on branch `fix/oauth-refresh-deadloop-310`
+
+## Files Modified
+
+| File | Changes |
+|------|---------|
+| `internal/oauth/refresh_manager.go` | Fix `calculateBackoff`, add terminal error handling in `handleRefreshFailure`, enforce min delay in `rescheduleAfterDelay`, change `DefaultMaxRetries` |
+| `internal/oauth/refresh_manager_test.go` | Add 5+ new test functions covering overflow, terminal errors, max retries, min delay |
+
+## Risk Assessment
+
+- **Low risk**: All changes are defensive additions (caps, guards, new classifications)
+- **No behavioral change for normal operations**: Backoff sequence 10s->20s->40s->80s->160s->300s unchanged for retryCount 0-5
+- **No API changes**: No public interface modifications
+- **Backward compatible**: Existing configs continue to work
diff --git a/spec.md b/spec.md
new file mode 100644
index 00000000..14feba01
--- /dev/null
+++ b/spec.md
@@ -0,0 +1,100 @@
+# Spec: Fix OAuth Refresh Deadloop (Issue #310)
+
+## Problem Statement
+
+When an OAuth server is removed from configuration or becomes unavailable, the refresh manager enters an infinite retry loop with 0-second delays, producing millions of retries per minute and spamming logs. The only recovery is restarting mcpproxy.
+
+### Root Cause Analysis
+
+There are **three distinct bugs** contributing to this issue:
+
+#### Bug 1: Integer Overflow in Backoff Calculation
+
+The `calculateBackoff` function uses `1 << uint(retryCount)` which overflows on 64-bit systems:
+
+- At `retryCount >= 30`: `1 << 30 = 1073741824`, multiplied by `10s` base overflows `time.Duration` (int64 nanoseconds) to a **negative value**. The `backoff > MaxRetryBackoff` check does NOT catch negative values, so the uncapped negative duration is used.
+- At `retryCount >= 63`: `1 << 63` overflows to the minimum int64 value, and `1 << 64 = 0`, giving `backoff = 0s`.
+- At `retryCount >= 64`: All further shifts produce `0`, giving `backoff = 0s` permanently.
+
+This means after ~30 retries (about 5 minutes of exponential backoff), the delay drops to 0 and retries become infinite tight loops.
+
+#### Bug 2: "Server Not Found" Classified as Retryable
+
+The error `"server not found: gcw2"` from `upstream/manager.go` is classified as `"failed_other"` by `classifyRefreshError`. This classification does NOT trigger the permanent failure path (only `"failed_invalid_grant"` does). Instead, it enters the retry path with backoff, which then hits Bug 1.
+
+When a server is removed from configuration, "server not found" is a **terminal** condition -- no amount of retrying will make the server appear. This should be treated as a permanent failure.
+
+#### Bug 3: No Maximum Retry Limit
+
+The constant `DefaultMaxRetries = 0` means "unlimited retries." Combined with Bug 1 causing 0s delays, this allows the retry count to reach 23M+ without any circuit breaker stopping it.
+
+## Solution
+
+### Fix 1: Overflow-Safe Backoff Calculation
+
+Cap the shift exponent to prevent integer overflow. The maximum useful exponent before hitting `MaxRetryBackoff` (5 minutes = 300s) with `RetryBackoffBase` (10s) is 5 (`10s * 2^5 = 320s > 300s`). Cap the exponent at a safe value (e.g., 30) that prevents overflow while still allowing the backoff cap to apply naturally.
+
+Additionally, add a guard: if the computed backoff is zero or negative, use `MaxRetryBackoff` as a fallback.
+
+### Fix 2: Classify "Server Not Found" as Terminal Error
+
+Add `"server not found"` to the list of terminal (permanent) errors in `classifyRefreshError`. When a server is not in the configuration, retrying is pointless. The same applies to `"server does not use OAuth"`.
+
+Create a new error classification category: `"failed_server_gone"` for errors that indicate the server no longer exists or is not applicable for OAuth refresh.
+
+### Fix 3: Add Maximum Consecutive Retry Limit
+
+Change `DefaultMaxRetries` from `0` (unlimited) to `50`. After 50 consecutive failures, mark the refresh as permanently failed and stop retrying. This acts as a circuit breaker even if Fixes 1 and 2 don't cover an edge case.
+
+The retry limit is per-schedule and resets on success (via `handleRefreshSuccess` which sets `RetryCount = 0`).
+
+### Fix 4: Enforce Minimum Delay Floor
+
+In `rescheduleAfterDelay`, enforce that the delay is never less than `MinRefreshInterval` (5 seconds). This provides defense-in-depth against any code path that computes a 0 or negative delay.
+
+## Assumptions
+
+1. **50 max retries is sufficient**: With exponential backoff (10s, 20s, 40s, ... 300s cap), 50 retries spans approximately 2+ hours. If a transient issue isn't resolved in 2 hours, it's likely permanent. Users can re-trigger refresh via re-login.
+
+2. **"Server not found" is always terminal**: If a server was removed from config while having stored OAuth tokens, the refresh should stop immediately. The stale token record will be cleaned up separately (or via `OnTokenCleared`).
+
+3. **"Server does not use OAuth" is terminal**: Similar to "server not found" -- if the server config changed to remove OAuth, retrying is pointless.
+
+4. **The backoff exponent cap of 30 is safe**: `1 << 30` on a 64-bit platform is `1073741824`, which when multiplied by 10s (`10_000_000_000 ns`) gives `10_737_418_240_000_000_000 ns` -- still within int64 range but will be capped by `MaxRetryBackoff` anyway. We clamp before this point for defense-in-depth.
+
+5. **No configuration change needed**: The max retry limit is a code-level constant, not a user-facing config. This keeps the fix simple and secure by default.
+
+## Functional Requirements
+
+| ID | Requirement |
+|----|-------------|
+| FR-001 | `calculateBackoff` MUST NOT produce zero or negative durations for any retry count value |
+| FR-002 | `classifyRefreshError` MUST classify "server not found" errors as terminal (non-retryable) |
+| FR-003 | `classifyRefreshError` MUST classify "server does not use OAuth" errors as terminal |
+| FR-004 | `handleRefreshFailure` MUST stop retrying after `DefaultMaxRetries` consecutive failures |
+| FR-005 | `rescheduleAfterDelay` MUST enforce a minimum delay of `MinRefreshInterval` (5 seconds) |
+| FR-006 | When max retries are exceeded, the schedule MUST transition to `RefreshStateFailed` |
+| FR-007 | When a terminal error is detected, the schedule MUST transition to `RefreshStateFailed` immediately |
+| FR-008 | A failure event MUST be emitted when retries are exhausted or a terminal error occurs |
+
+## Non-Functional Requirements
+
+| ID | Requirement |
+|----|-------------|
+| NFR-001 | All changes must pass `go test -race ./internal/oauth/...` |
+| NFR-002 | All changes must compile with `go build ./...` |
+| NFR-003 | No breaking changes to public APIs or configuration |
+| NFR-004 | Backward-compatible: existing exponential backoff behavior preserved for normal retry ranges |
+
+## Testing Requirements
+
+| Test | Description |
+|------|-------------|
+| T001 | `calculateBackoff` returns correct values at overflow boundaries (retryCount=30, 63, 64, 100, 1000) |
+| T002 | `calculateBackoff` never returns zero or negative for any retryCount 0-10000 |
+| T003 | `classifyRefreshError` classifies "server not found" as terminal |
+| T004 | `classifyRefreshError` classifies "server does not use OAuth" as terminal |
+| T005 | `handleRefreshFailure` stops after max retries and emits failure event |
+| T006 | `handleRefreshFailure` treats terminal errors same as invalid_grant (immediate stop) |
+| T007 | `rescheduleAfterDelay` enforces minimum delay |
+| T008 | Integration: server-not-found error stops refresh immediately (no retry) |

From 8aebb8760cc6de9b4b3eba58c294a35de1b95834 Mon Sep 17 00:00:00 2001
From: Claude Code <noreply@anthropic.com>
Date: Sun, 1 Mar 2026 17:00:52 +0200
Subject: [PATCH 2/2] chore: remove spec/plan/summary artifacts from PR

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 autonomous_summary.md |  71 ------------------------------
 plan.md               |  87 ------------------------------------
 spec.md               | 100 ------------------------------------------
 3 files changed, 258 deletions(-)
 delete mode 100644 autonomous_summary.md
 delete mode 100644 plan.md
 delete mode 100644 spec.md

diff --git a/autonomous_summary.md b/autonomous_summary.md
deleted file mode 100644
index 6e92b65a..00000000
--- a/autonomous_summary.md
+++ /dev/null
@@ -1,71 +0,0 @@
-# Autonomous Summary: Fix OAuth Refresh Deadloop (Issue #310)
-
-## Problem
-
-When an OAuth server was removed from configuration or became unavailable, the refresh manager entered an infinite retry loop with 0-second delays, producing 23M+ retries and flooding logs. The only recovery was restarting mcpproxy.
-
-## Root Cause
-
-Three interconnected bugs in `internal/oauth/refresh_manager.go`:
-
-1. **Integer overflow in `calculateBackoff`**: The expression `1 << uint(retryCount)` overflows at retryCount >= 30 on 64-bit systems. At retryCount 30, the shift produces a value that, when multiplied by `RetryBackoffBase` (10s in nanoseconds), overflows `int64` to a negative value. The comparison `backoff > MaxRetryBackoff` fails to catch negative values. At retryCount >= 64, the shift wraps to 0, producing `backoff = 0s` permanently.
-
-2. **"Server not found" classified as retryable**: The error `"server not found: gcw2"` from `upstream/manager.go` was classified as `"failed_other"` — neither terminal nor network-retryable. This caused it to enter the retry path, which eventually hit the overflow bug.
-
-3. **No maximum retry limit**: `DefaultMaxRetries = 0` meant unlimited retries, so once the backoff dropped to 0s, retries looped infinitely with no circuit breaker.
-
-## Changes Made
-
-### `internal/oauth/refresh_manager.go`
-
-1. **Fixed `calculateBackoff` overflow** (lines 800-820):
-   - Added `maxBackoffExponent = 25` constant
-   - When `retryCount > maxBackoffExponent`, returns `MaxRetryBackoff` immediately
-   - Added post-computation guard: if `backoff <= 0`, use `MaxRetryBackoff`
-
-2. **Added terminal error classification** (lines 760-770):
-   - New classification `"failed_server_gone"` for "server not found" and "server does not use OAuth" errors
-   - These are checked before network/grant errors since they should never be retried
-
-3. **Handle terminal errors in `handleRefreshFailure`** (lines 718-733):
-   - `"failed_server_gone"` errors immediately set `RefreshStateFailed` and emit failure event
-   - No retry scheduled — the server won't magically reappear
-
-4. **Changed `DefaultMaxRetries` from 0 to 50** (line 25):
-   - Circuit breaker: after 50 consecutive failures, stops retrying
-   - With exponential backoff (10s base, 5min cap), 50 retries spans ~2+ hours
-
-5. **Check max retries in `handleRefreshFailure`** (lines 735-750):
-   - When `retryCount >= maxRetries`, sets `RefreshStateFailed` and emits failure event
-
-6. **Enforced minimum delay in `rescheduleAfterDelay`** (lines 825-828):
-   - If `delay < MinRefreshInterval` (5s), uses `MinRefreshInterval`
-   - Defense-in-depth against any code path producing 0 or negative delays
-
-### `internal/oauth/refresh_manager_test.go`
-
-Added 6 new test functions:
-
-- `TestRefreshManager_BackoffOverflowProtection`: Verifies backoff is positive and capped at boundaries (retryCount 30, 63, 64, 100, 1000, 23158728) and exhaustively for 0-10000
-- `TestRefreshManager_ServerNotFoundIsTerminal`: Verifies "server not found" stops refresh immediately with no retry
-- `TestRefreshManager_MaxRetryLimit`: Verifies max retry limit stops retries and emits failure event
-- `TestDefaultMaxRetries_IsNonZero`: Verifies the constant is non-zero
-- `TestRefreshManager_MinimumDelayEnforced`: Verifies 0 and negative delays are clamped to MinRefreshInterval
-- Extended `TestClassifyRefreshError` with 3 new cases for server-gone errors
-
-## Test Results
-
-```
-go test -race ./internal/oauth/... -v  →  PASS (17.2s)
-go build ./internal/oauth/...          →  OK
-go vet ./internal/oauth/...            →  OK
-```
-
-All 28 tests in the oauth package pass, including all pre-existing tests (backward compatible).
-
-## Verification
-
-The fix addresses all three failure modes from the issue:
-- **Overflow**: `calculateBackoff(23158728)` now returns `5m0s` (was `0s`)
-- **Terminal errors**: "server not found" immediately stops (was infinite retry)
-- **Circuit breaker**: Even unknown errors stop after 50 retries (was unlimited)
diff --git a/plan.md b/plan.md
deleted file mode 100644
index 9e66fd53..00000000
--- a/plan.md
+++ /dev/null
@@ -1,87 +0,0 @@
-# Implementation Plan: Fix OAuth Refresh Deadloop (Issue #310)
-
-## Overview
-
-Four changes to `internal/oauth/refresh_manager.go` and updates to `internal/oauth/refresh_manager_test.go`, implemented in TDD order.
-
-## Step 1: Fix `calculateBackoff` Integer Overflow
-
-### Test First
-Add `TestRefreshManager_BackoffOverflowProtection` to verify:
-- `calculateBackoff(30)` returns `MaxRetryBackoff` (not negative/zero)
-- `calculateBackoff(63)` returns `MaxRetryBackoff` (not zero)
-- `calculateBackoff(64)` returns `MaxRetryBackoff` (not zero)
-- `calculateBackoff(1000)` returns `MaxRetryBackoff` (not zero)
-- Loop 0..10000 verifying all results > 0 and <= MaxRetryBackoff
-
-### Implementation
-In `calculateBackoff`:
-- Cap `retryCount` to a safe maximum exponent (e.g., 30) to prevent overflow
-- Add post-computation guard: if `backoff <= 0`, set to `MaxRetryBackoff`
-
-## Step 2: Add Terminal Error Classification for "Server Not Found"
-
-### Test First
-Add test cases to `TestClassifyRefreshError`:
-- `"server not found: gcw2"` -> `"failed_server_gone"`
-- `"server does not use OAuth: gcw2"` -> `"failed_server_gone"`
-
-Add `TestRefreshManager_ServerNotFoundIsTerminal` to verify:
-- `handleRefreshFailure` with "server not found" error stops immediately
-- Schedule transitions to `RefreshStateFailed`
-- Failure event is emitted
-
-### Implementation
-In `classifyRefreshError`:
-- Add new terminal error patterns: `"server not found"`, `"server does not use OAuth"`
-- Return `"failed_server_gone"` for these patterns
-
-In `handleRefreshFailure`:
-- Handle `"failed_server_gone"` the same as `"failed_invalid_grant"` (immediate stop, no retry)
-
-## Step 3: Add Maximum Retry Limit
-
-### Test First
-Add `TestRefreshManager_MaxRetryLimit` to verify:
-- After `DefaultMaxRetries` (50) consecutive failures, schedule transitions to `RefreshStateFailed`
-- Failure event is emitted when limit is reached
-
-### Implementation
-- Change `DefaultMaxRetries` from `0` to `50`
-- In `handleRefreshFailure`, after classifying the error and before calculating backoff:
-  - Check if `retryCount >= m.maxRetries` (when `m.maxRetries > 0`)
-  - If exceeded, log error, set state to `RefreshStateFailed`, emit event, return
-
-Note: Update `NewRefreshManager` to use the new default -- currently `if config.MaxRetries > 0` would override, so 0 still means "use default of 50".
-
-## Step 4: Enforce Minimum Delay Floor in `rescheduleAfterDelay`
-
-### Test First
-Add `TestRefreshManager_MinimumDelayEnforced` to verify:
-- `rescheduleAfterDelay` with 0 delay uses `MinRefreshInterval` instead
-- `rescheduleAfterDelay` with negative delay uses `MinRefreshInterval` instead
-
-### Implementation
-In `rescheduleAfterDelay`:
-- Add check: `if delay < MinRefreshInterval { delay = MinRefreshInterval }`
-
-## Step 5: Verify and Commit
-
-1. Run `go test -race ./internal/oauth/... -v`
-2. Run `go build ./...`
-3. Write `autonomous_summary.md`
-4. Commit all changes on branch `fix/oauth-refresh-deadloop-310`
-
-## Files Modified
-
-| File | Changes |
-|------|---------|
-| `internal/oauth/refresh_manager.go` | Fix `calculateBackoff`, add terminal error handling in `handleRefreshFailure`, enforce min delay in `rescheduleAfterDelay`, change `DefaultMaxRetries` |
-| `internal/oauth/refresh_manager_test.go` | Add 5+ new test functions covering overflow, terminal errors, max retries, min delay |
-
-## Risk Assessment
-
-- **Low risk**: All changes are defensive additions (caps, guards, new classifications)
-- **No behavioral change for normal operations**: Backoff sequence 10s->20s->40s->80s->160s->300s unchanged for retryCount 0-5
-- **No API changes**: No public interface modifications
-- **Backward compatible**: Existing configs continue to work
diff --git a/spec.md b/spec.md
deleted file mode 100644
index 14feba01..00000000
--- a/spec.md
+++ /dev/null
@@ -1,100 +0,0 @@
-# Spec: Fix OAuth Refresh Deadloop (Issue #310)
-
-## Problem Statement
-
-When an OAuth server is removed from configuration or becomes unavailable, the refresh manager enters an infinite retry loop with 0-second delays, producing millions of retries per minute and spamming logs. The only recovery is restarting mcpproxy.
-
-### Root Cause Analysis
-
-There are **three distinct bugs** contributing to this issue:
-
-#### Bug 1: Integer Overflow in Backoff Calculation
-
-The `calculateBackoff` function uses `1 << uint(retryCount)` which overflows on 64-bit systems:
-
-- At `retryCount >= 30`: `1 << 30 = 1073741824`, multiplied by `10s` base overflows `time.Duration` (int64 nanoseconds) to a **negative value**. The `backoff > MaxRetryBackoff` check does NOT catch negative values, so the uncapped negative duration is used.
-- At `retryCount >= 63`: `1 << 63` overflows to the minimum int64 value, and `1 << 64 = 0`, giving `backoff = 0s`.
-- At `retryCount >= 64`: All further shifts produce `0`, giving `backoff = 0s` permanently.
-
-This means after ~30 retries (about 5 minutes of exponential backoff), the delay drops to 0 and retries become infinite tight loops.
-
-#### Bug 2: "Server Not Found" Classified as Retryable
-
-The error `"server not found: gcw2"` from `upstream/manager.go` is classified as `"failed_other"` by `classifyRefreshError`. This classification does NOT trigger the permanent failure path (only `"failed_invalid_grant"` does). Instead, it enters the retry path with backoff, which then hits Bug 1.
-
-When a server is removed from configuration, "server not found" is a **terminal** condition -- no amount of retrying will make the server appear. This should be treated as a permanent failure.
-
-#### Bug 3: No Maximum Retry Limit
-
-The constant `DefaultMaxRetries = 0` means "unlimited retries." Combined with Bug 1 causing 0s delays, this allows the retry count to reach 23M+ without any circuit breaker stopping it.
-
-## Solution
-
-### Fix 1: Overflow-Safe Backoff Calculation
-
-Cap the shift exponent to prevent integer overflow. The maximum useful exponent before hitting `MaxRetryBackoff` (5 minutes = 300s) with `RetryBackoffBase` (10s) is 5 (`10s * 2^5 = 320s > 300s`). Cap the exponent at a safe value (e.g., 30) that prevents overflow while still allowing the backoff cap to apply naturally.
-
-Additionally, add a guard: if the computed backoff is zero or negative, use `MaxRetryBackoff` as a fallback.
-
-### Fix 2: Classify "Server Not Found" as Terminal Error
-
-Add `"server not found"` to the list of terminal (permanent) errors in `classifyRefreshError`. When a server is not in the configuration, retrying is pointless. The same applies to `"server does not use OAuth"`.
-
-Create a new error classification category: `"failed_server_gone"` for errors that indicate the server no longer exists or is not applicable for OAuth refresh.
-
-### Fix 3: Add Maximum Consecutive Retry Limit
-
-Change `DefaultMaxRetries` from `0` (unlimited) to `50`. After 50 consecutive failures, mark the refresh as permanently failed and stop retrying. This acts as a circuit breaker even if Fixes 1 and 2 don't cover an edge case.
-
-The retry limit is per-schedule and resets on success (via `handleRefreshSuccess` which sets `RetryCount = 0`).
-
-### Fix 4: Enforce Minimum Delay Floor
-
-In `rescheduleAfterDelay`, enforce that the delay is never less than `MinRefreshInterval` (5 seconds). This provides defense-in-depth against any code path that computes a 0 or negative delay.
-
-## Assumptions
-
-1. **50 max retries is sufficient**: With exponential backoff (10s, 20s, 40s, ... 300s cap), 50 retries spans approximately 2+ hours. If a transient issue isn't resolved in 2 hours, it's likely permanent. Users can re-trigger refresh via re-login.
-
-2. **"Server not found" is always terminal**: If a server was removed from config while having stored OAuth tokens, the refresh should stop immediately. The stale token record will be cleaned up separately (or via `OnTokenCleared`).
-
-3. **"Server does not use OAuth" is terminal**: Similar to "server not found" -- if the server config changed to remove OAuth, retrying is pointless.
-
-4. **The backoff exponent cap of 30 is safe**: `1 << 30` on a 64-bit platform is `1073741824`, which when multiplied by 10s (`10_000_000_000 ns`) gives `10_737_418_240_000_000_000 ns` -- still within int64 range but will be capped by `MaxRetryBackoff` anyway. We clamp before this point for defense-in-depth.
-
-5. **No configuration change needed**: The max retry limit is a code-level constant, not a user-facing config. This keeps the fix simple and secure by default.
-
-## Functional Requirements
-
-| ID | Requirement |
-|----|-------------|
-| FR-001 | `calculateBackoff` MUST NOT produce zero or negative durations for any retry count value |
-| FR-002 | `classifyRefreshError` MUST classify "server not found" errors as terminal (non-retryable) |
-| FR-003 | `classifyRefreshError` MUST classify "server does not use OAuth" errors as terminal |
-| FR-004 | `handleRefreshFailure` MUST stop retrying after `DefaultMaxRetries` consecutive failures |
-| FR-005 | `rescheduleAfterDelay` MUST enforce a minimum delay of `MinRefreshInterval` (5 seconds) |
-| FR-006 | When max retries are exceeded, the schedule MUST transition to `RefreshStateFailed` |
-| FR-007 | When a terminal error is detected, the schedule MUST transition to `RefreshStateFailed` immediately |
-| FR-008 | A failure event MUST be emitted when retries are exhausted or a terminal error occurs |
-
-## Non-Functional Requirements
-
-| ID | Requirement |
-|----|-------------|
-| NFR-001 | All changes must pass `go test -race ./internal/oauth/...` |
-| NFR-002 | All changes must compile with `go build ./...` |
-| NFR-003 | No breaking changes to public APIs or configuration |
-| NFR-004 | Backward-compatible: existing exponential backoff behavior preserved for normal retry ranges |
-
-## Testing Requirements
-
-| Test | Description |
-|------|-------------|
-| T001 | `calculateBackoff` returns correct values at overflow boundaries (retryCount=30, 63, 64, 100, 1000) |
-| T002 | `calculateBackoff` never returns zero or negative for any retryCount 0-10000 |
-| T003 | `classifyRefreshError` classifies "server not found" as terminal |
-| T004 | `classifyRefreshError` classifies "server does not use OAuth" as terminal |
-| T005 | `handleRefreshFailure` stops after max retries and emits failure event |
-| T006 | `handleRefreshFailure` treats terminal errors same as invalid_grant (immediate stop) |
-| T007 | `rescheduleAfterDelay` enforces minimum delay |
-| T008 | Integration: server-not-found error stops refresh immediately (no retry) |