Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 31 additions & 26 deletions design/llm-gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,9 @@ The trigger for this design is the [observation framework](observation-framework

- Single chokepoint for every LLM call, regardless of caller.
- Per-tier concurrency caps (Low / Balanced / High treated independently).
- Honor `Retry-After` on 429s, with exponential backoff as fallback.
- Cancellation propagates end-to-end: pending and in-flight calls abort when their `ct` fires.
- User-facing calls effectively preempt background calls without a priority queue.
- Centralized observability: token counts, latency, retry counts, queue depth, cost.
- Centralized observability: queue depth, slot wait time, in-flight count.
- Bounded queue depth: the system fails fast under sustained rate limiting rather than piling up indefinitely.

## Non-goals
Expand All @@ -26,10 +25,11 @@ The trigger for this design is the [observation framework](observation-framework
- Replacing the SDK. The gateway wraps the SDK; it does not reimplement provider clients.
- Per-call routing across providers. Tier selection remains the caller's concern; the gateway only mediates calls within whatever tier the caller chose.
- Solving observation-framework parallelism by itself. The gateway is a prerequisite, not a replacement.
- **Owning rate-limit retry.** The OpenAI SDK's `ClientRetryPolicy` already does this well — `Retry-After` honoring, exponential backoff with jitter, and retries for 429s, 5xx, and transient network errors. Reimplementing it in the gateway was attempted in an earlier draft of this design and rejected: the gateway version was less mature (no jitter, no 5xx, no transient-error coverage) and added code without meaningful benefit beyond the metrics. The SDK retry stays enabled. See "Retry policy" below for the consequences.

## Architecture

Every LLM call passes through the gateway. The gateway holds per-tier `SemaphoreSlim` instances and applies retry middleware on the way through.
Every LLM call passes through the gateway. The gateway holds per-tier `SemaphoreSlim` instances; rate-limit retry happens *inside* the slot, in the SDK pipeline.

```
Caller (handler / dream phase / subagent)
Expand All @@ -40,21 +40,20 @@ Every LLM call passes through the gateway. The gateway holds per-tier `Semaphore
│ │
│ await _tierSemaphores[tier].WaitAsync(ct); │
│ try { │
│ // retry middleware: │
│ // on 429 -> honor Retry-After or │
│ // exponential backoff (ct-aware sleep) │
│ // max attempts capped │
│ // metrics: tokens, latency, retries, $$ │
│ // metrics: slot wait, in-flight, queue │
│ // depth, latency │
│ // SDK pipeline handles 429/5xx retry │
│ // internally; ct propagates │
│ } finally { │
│ _tierSemaphores[tier].Release(); │
│ } │
└────────────────────┬────────────────────────────┘
Provider SDK (with built-in retry disabled)
Provider SDK (with built-in retry policy)
```

The gateway is the only direct consumer of the provider SDK. `ILlmClient` implementations that bypass it must be removed.
The gateway is the only direct consumer of the provider SDK. `ILlmClient` implementations that bypass it must be removed. The slot is held during any SDK-internal retry waits — releasing the slot during a retry would not help, since rate limits are per-tier so any other call in the same tier would hit the same limit.

## Cancellation, not priority lanes

Expand All @@ -72,17 +71,22 @@ The one behavioral difference: priority lanes would let dream continue running i

## Retry policy

On 429:
The OpenAI SDK's `ClientRetryPolicy` (the default in `System.ClientModel`) owns retry. It already provides:

1. If the response carries a `Retry-After` header (Anthropic and OpenAI both do), wait that long. The wait is `ct`-aware (`Task.Delay(retryAfter, ct)`).
2. If no `Retry-After`, fall back to exponential backoff: 1s, 2s, 4s, 8s, capped at some configurable maximum (16s suggested).
3. After a configurable maximum number of retries (default 5), surface the failure to the caller. The caller decides what to do.
- `Retry-After` header honoring on 429 and 503 responses
- Exponential backoff with jitter when no header is supplied
- Retries on 408, 429, 500, 502, 503, 504, and transient network errors
- Configurable max attempts (default 3)
- Cancellation propagation through the pipeline

The slot is held during the retry wait. Releasing the slot during the wait does not help: rate limits are per-tier, so any other call in that tier will hit the same limit. Holding the slot also keeps priority semantics simple: a retried call does not jump ahead of a newly-arrived call (or vice versa) because there is no priority to think about.
This is more thorough than what the gateway could reasonably reimplement. The gateway therefore does *not* wrap retry. The SDK pipeline runs retry inside the gateway slot, which means:

The provider SDK's built-in retry MUST be disabled. Gateway and SDK both retrying causes them to fight: the SDK retries silently, gateway sees success, then the same call hits the limit again. The gateway owns retry; the SDK does not.
- Retry waits hold the slot. As discussed above, releasing during a retry wait would not help under per-tier rate limiting.
- Cancellation reaches retry waits through the standard `ct` propagation. The gateway does not need its own retry-cancellation logic.

Other error classes (5xx, network errors) get their own short retry policy (1 retry with brief backoff) for transient failures, but are not the primary concern. 401/403 and 4xx-other are non-retryable and surface immediately.
**Implication for cross-provider observability:** the gateway does not see individual retry events — they happen inside the SDK pipeline, below the gateway's instrumentation. If retry counts and retry-after sources become important enough to centralize, they should be added by hooking the SDK pipeline (a `PipelinePolicy`-derived listener that forwards retry telemetry into `HostDiagnostics`), not by reimplementing retry above the SDK. This is recorded as a possible future enhancement.

**Non-OpenAI providers.** `CopilotChatClient` has its own retry inside the client (configured via `LlmTierConfig.MaxRetries`). Other future providers will likewise own their own retry. The gateway is provider-agnostic and remains so.

## Cancellation discipline

Expand Down Expand Up @@ -135,20 +139,21 @@ Surface via the existing telemetry pipeline (logs at minimum; metrics endpoint i

## Implementation phases

1. **Wrap, don't replace.** The gateway is a thin layer in front of `ILlmClient` (or whatever the existing abstraction is). Existing call sites already go through that abstraction; the gateway is plumbed in behind it without touching call sites.
2. **Disable SDK retry.** Configure the underlying SDK clients to not retry. Confirm via test.
3. **Add per-tier semaphores + retry middleware.** Honor `Retry-After`, exponential fallback, ct-aware sleeps.
4. **Audit ct propagation.** Walk every existing LLM call site, confirm ct flows. Fix any that pass `None`.
5. **Add metrics.** Wire tokens / latency / retries through the existing telemetry pipeline.
6. **Add bounded queue.** Start with fail-fast.
7. **Test cancellation contract.** Per-tier test that asserts cancellation aborts pending and in-flight calls promptly.
1. **Wrap, don't replace.** The gateway is a thin layer in front of `ILlmClient` (or whatever the existing abstraction is). Existing call sites already go through that abstraction; the gateway is plumbed in behind it without touching call sites. ✅ Landed in #355.
2. **Add per-tier semaphores.** Done as part of phase 1 — `LlmGateway` with per-tier `SemaphoreSlim` and configurable caps. ✅
3. **Add gateway metrics.** Slot wait time, in-flight count, queue depth. Slot-wait metric landed in phase 1. ✅
4. **Add bounded queue.** `MaxPendingPerTier` with fail-fast. (Phase 3.)
5. **Audit ct propagation.** Walk every existing LLM call site, confirm `ct` flows. Fix any that pass `default`/`None`. Make `ct` mandatory at the `ILlmClient` boundary so future violations cannot be introduced. (Phase 4.)
6. **Test cancellation contract.** Per-tier test that asserts cancellation aborts pending and in-flight calls promptly. Pending-side covered in phase 1; in-flight side completed in phase 4 alongside the audit.

A previously-planned phase to wrap rate-limit retry inside the gateway was attempted and reverted (see "Retry policy" and "Non-goals"). The OpenAI SDK's retry is more capable than what the gateway would build; we keep it enabled and let it run inside the gateway slot.

Step 4 is the one that has follow-on work elsewhere in the codebase. Each violation is small but they have to all be fixed before the cancellation guarantee holds.
Phase 5 is the one that has follow-on work elsewhere in the codebase. Each violation is small but they have to all be fixed before the cancellation guarantee holds.

## Open questions

- **Cap defaults.** The 8/4/2 split above is a guess. Real workload will inform tuning. Should be config-overridable.
- **Cross-process coordination.** If multiple agent processes run against the same provider account, per-process caps undercount real concurrency. Out of scope for v1, but worth noting as a future concern (Redis-backed gateway? provider-side rate-limit tracking?).
- **Streaming responses.** Do any current call sites use streaming? If so, "in-flight" measurement and cancellation behave differently. Confirm during the audit pass.
- **Subagent message-bus boundary.** Subagent calls are dispatched via `IMessagePublisher`, which means `ct` does not naturally cross the bus. The receiving handler establishes a fresh `ct` from its own work-serializer slot, which is the right behavior — but worth confirming that no LLM calls inside a subagent handler accidentally fall back to `None` because the original ct didn't survive the boundary.
- **Retry budget per call vs. per cycle.** Should there be a global budget on retries-per-dream-cycle so that a degraded provider doesn't burn the whole cycle on a single phase's retries? Probably premature, but record as a thought.
- **Pipeline-level retry telemetry.** The SDK does retries inside the gateway slot but doesn't surface retry events to our diagnostics. If centralized retry visibility becomes important, add a `PipelinePolicy`-derived listener that forwards retry telemetry into `HostDiagnostics`. Probably premature; record as a thought.