Skip to content

Revise LLM gateway design: SDK owns retry#357

Merged
rockfordlhotka merged 1 commit intomainfrom
design/llm-gateway-retry-decision
May 8, 2026
Merged

Revise LLM gateway design: SDK owns retry#357
rockfordlhotka merged 1 commit intomainfrom
design/llm-gateway-retry-decision

Conversation

@rockfordlhotka
Copy link
Copy Markdown
Member

Summary

Updates design/llm-gateway.md to reflect the decision made after attempting phase 2 of #352: keep the OpenAI SDK's ClientRetryPolicy enabled and let the gateway focus on concurrency capping. PR #356 (the gateway-retry implementation) was closed without merging.

The SDK retry is more mature than what the gateway would reasonably reimplement (Retry-After honoring + jitter + 429/5xx/transient-network coverage). Reimplementing it traded maturity for a small observability gain that can be addressed later via a PipelinePolicy listener if it becomes important.

Changes

  • Goals/Non-goals: removes "honor Retry-After" and "bounded retry" goals; adds an explicit non-goal "owning rate-limit retry" with reasoning so the decision survives future drift
  • Architecture diagram: updated to show retry happening in the SDK pipeline inside the gateway slot, not in the gateway itself
  • Retry policy section: rewritten to describe the SDK as the retry owner; notes the slot-held-during-retry behavior and the cross-provider implications
  • Implementation phases: phase 1 ✅ marked complete; phases renumbered (phase 3 = bounded queue, phase 4 = ct audit + mandatory ct on ILlmClient); the abandoned retry phase explicitly recorded
  • Open questions: replaces "retry budget per cycle" with "pipeline-level retry telemetry" as the relevant future concern

Test plan

  • Read updated design/llm-gateway.md end-to-end and confirm the rationale for keeping SDK retry is captured clearly enough to deter future "let's add gateway retry again" drift

🤖 Generated with Claude Code

…ency

The original design called for the gateway to own rate-limit retry and for
the SDK's built-in retry to be disabled. Implementation in PR #356
revealed that the OpenAI SDK's ClientRetryPolicy is more capable than what
the gateway would reasonably build:

- Honors Retry-After on 429 and 503
- Exponential backoff with jitter (the gateway version had no jitter,
  risking thundering-herd retry on shared rate limits)
- Retries on 408, 429, 5xx, and transient network errors (the gateway
  version handled only 429)
- Maintained as providers change behavior

Reimplementing this in the gateway traded a more mature retry for a less
mature one, with no benefit beyond centralized metrics. PR #356 was closed
without merging.

Updates the design doc to reflect the decision:
- Retry stays in the SDK pipeline, runs inside the gateway slot
- Gateway focuses on concurrency capping, cancellation, and queue-depth
  telemetry — the load-bearing pieces
- Pipeline-level retry telemetry recorded as a future enhancement
- Implementation phases renumbered: phase 3 = bounded queue, phase 4 =
  ct audit + mandatory ct on ILlmClient
- Non-goals section explicitly records "owning rate-limit retry" as
  rejected, with reasoning, so this decision survives future drift

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rockfordlhotka rockfordlhotka merged commit 1801cb4 into main May 8, 2026
@rockfordlhotka rockfordlhotka deleted the design/llm-gateway-retry-decision branch May 8, 2026 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant