Skip to content

LLM gateway: rate-limit retry + disable SDK retry (phase 2 of #352)#356

Closed
rockfordlhotka wants to merge 1 commit intomainfrom
feature/llm-gateway-retry
Closed

LLM gateway: rate-limit retry + disable SDK retry (phase 2 of #352)#356
rockfordlhotka wants to merge 1 commit intomainfrom
feature/llm-gateway-retry

Conversation

@rockfordlhotka
Copy link
Copy Markdown
Member

Second of four phases implementing design/llm-gateway.md. Adds retry-on-429 to the gateway and disables the OpenAI SDK's built-in retry so the two layers don't fight.

Builds on phase 1 (#355).

Summary

Gateway side

  • LlmGatewayOptions.MaxRateLimitRetries (default 5) and MaxBackoffSeconds (default 16)
  • ILlmRateLimitClassifier strategy with DefaultLlmRateLimitClassifier recognising ClientResultException status 429 (with Retry-After header parsing — both numeric seconds and HTTP-date forms) and HttpRequestException with TooManyRequests status. Walks inner-exception chain for wrapped errors.
  • LlmGateway.ExecuteWithRetryAsync wraps the operation: on a recognised rate-limit, honours Retry-After when present, falls back to exponential backoff (1s, 2s, 4s, ...) capped at MaxBackoffSeconds. Slot is held throughout (releasing wouldn't help — rate limits are per-tier). Sleep is ct-aware so cancellation aborts the retry wait.
  • New metric rockbot.llm.gateway.rate_limit_retries (counter) tagged by tier and retry_after_source ("header" vs "backoff").
  • New constructor validation: MaxRateLimitRetries >= 0, MaxBackoffSeconds >= 1.

SDK side

  • RetryPolicy = new ClientRetryPolicy(maxRetries: 0) added to every OpenAIClient construction in agent Program.cs files (RockBot.Agent, RockBot.ResearchAgent, RockBot.SampleAgent). The Sample HTTP agent and the embedding client are intentionally untouched — neither routes through the gateway.

Tests

  • 8 new gateway retry tests (LlmGatewayTests.cs): success-after-retry, exhaustion, Retry-After honouring, exponential backoff, cancel-during-wait, non-rate-limit-not-retried, plus options validation
  • 5 new classifier tests (DefaultLlmRateLimitClassifierTests.cs): positive/negative HttpRequestException paths, inner-exception walking, AggregateException-collection-not-walked (documenting current behaviour)

Test plan

  • dotnet build RockBot.slnx — clean
  • dotnet test --filter "FullyQualifiedName~LlmGatewayTests|FullyQualifiedName~DefaultLlmRateLimitClassifierTests" — 21/21 pass
  • Full Host suite — 637/637 pass
  • Full solution test run — all projects pass

Out of scope (later phases)

  • Phase 3: bounded queue depth (MaxPendingPerTier) with fail-fast
  • Phase 4: ct propagation audit + make ct mandatory on ILlmClient

🤖 Generated with Claude Code

Adds retry-on-429 to the gateway with Retry-After honoring and
exponential-backoff fallback. Disables the OpenAI SDK's built-in retry
in every chat-client construction site so gateway and SDK don't
double-retry silently on rate-limit responses.

Phase 2 scope per design/llm-gateway.md:

Gateway side:
- LlmGatewayOptions adds MaxRateLimitRetries (default 5) and
  MaxBackoffSeconds (default 16)
- ILlmRateLimitClassifier strategy with DefaultLlmRateLimitClassifier
  recognizing ClientResultException status 429 (with Retry-After header
  parsing) and HttpRequestException with TooManyRequests status
- LlmGateway.ExecuteWithRetryAsync wraps the operation: on a classifier-
  recognized rate limit, honors Retry-After when supplied or falls back
  to exponential backoff (1s, 2s, 4s, ...) capped at MaxBackoffSeconds.
  Slot is held throughout (releasing wouldn't help since rate limits
  are per-tier). Sleep is ct-aware so cancellation aborts the retry.
- New metric: rockbot.llm.gateway.rate_limit_retries (counter), tagged
  by tier and retry-after source ("header" vs "backoff")
- Constructor validates MaxRateLimitRetries >= 0 and MaxBackoffSeconds >= 1

SDK side:
- Adds RetryPolicy = new ClientRetryPolicy(maxRetries: 0) to every
  OpenAIClient construction in agent Program.cs files (Agent,
  ResearchAgent, SampleAgent). The Sample HTTP agent and the embedding
  client are intentionally not modified — they don't route through the
  gateway.

Tests:
- 8 new gateway retry tests (success-after-retry, exhaustion, Retry-
  After honoring, exponential backoff, cancel-during-wait, non-rate-
  limit-not-retried) using a stubbed FakeRateLimitClassifier
- 5 new DefaultLlmRateLimitClassifier tests covering positive/negative
  HttpRequestException paths and inner-exception walking
- All 21 gateway+classifier unit tests pass; all 637 Host tests pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rockfordlhotka
Copy link
Copy Markdown
Member Author

Closing without merging — see thread on the design tradeoff. Decision: re-enable the OpenAI SDK retry (more mature than the gateway retry we'd build: jitter, 5xx, transient network errors) and let the gateway focus on its load-bearing role of concurrency capping. The design doc and #352 will be updated accordingly. The phase 2 gateway-retry code is discarded with this PR; the SDK retry was never actually disabled in main since #355 only landed phase 1.

@rockfordlhotka rockfordlhotka deleted the feature/llm-gateway-retry branch May 8, 2026 00:53
rockfordlhotka added a commit that referenced this pull request May 8, 2026
…ency (#357)

The original design called for the gateway to own rate-limit retry and for
the SDK's built-in retry to be disabled. Implementation in PR #356
revealed that the OpenAI SDK's ClientRetryPolicy is more capable than what
the gateway would reasonably build:

- Honors Retry-After on 429 and 503
- Exponential backoff with jitter (the gateway version had no jitter,
  risking thundering-herd retry on shared rate limits)
- Retries on 408, 429, 5xx, and transient network errors (the gateway
  version handled only 429)
- Maintained as providers change behavior

Reimplementing this in the gateway traded a more mature retry for a less
mature one, with no benefit beyond centralized metrics. PR #356 was closed
without merging.

Updates the design doc to reflect the decision:
- Retry stays in the SDK pipeline, runs inside the gateway slot
- Gateway focuses on concurrency capping, cancellation, and queue-depth
  telemetry — the load-bearing pieces
- Pipeline-level retry telemetry recorded as a future enhancement
- Implementation phases renumbered: phase 3 = bounded queue, phase 4 =
  ct audit + mandatory ct on ILlmClient
- Non-goals section explicitly records "owning rate-limit retry" as
  rejected, with reasoning, so this decision survives future drift

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant