Implements the LLM gateway design (design/llm-gateway.md).
Adds a global chokepoint for every LLM call: per-tier concurrency caps, centralized retry on 429 (Retry-After-aware, exponential fallback), end-to-end cancellation, and centralized metrics. Trigger: parallelism in upcoming work (e.g. observation framework) makes the current ad-hoc rate-limit story untenable.
Scope
Wrap, don't replace
- Thin gateway layer in front of
ILlmClient. Existing call sites unchanged at signature level — ct propagation audit is the only call-site change required.
- Provider SDK becomes the gateway's only direct consumer.
Per-tier concurrency caps
- One
SemaphoreSlim per tier (Low / Balanced / High).
- Initial caps configurable; suggested defaults 8 / 4 / 2.
- Caps are per-process. Cross-process coordination explicitly out of scope (see open questions in the design doc).
Retry on 429
- Honor
Retry-After when present; fall back to exponential backoff (1s, 2s, 4s, 8s, capped at 16s).
- All sleeps are
ct-aware (Task.Delay(_, ct)).
- Slot is held during retry waits.
- Configurable max attempts (default 5); after that, surface the failure to the caller.
- Provider SDK retry MUST be disabled — gateway and SDK both retrying causes silent double-retry.
Cancellation discipline
ct is mandatory at the gateway boundary (no default overload).
- Audit every existing LLM call site to confirm
ct flows through; fix any that pass None.
- Test that cancelling a
ct during a simulated long call aborts within a bounded time, per tier.
Bounded queue depth
MaxPendingPerTier config. When exceeded, fail-fast with a "gateway saturated" error — caller decides whether to skip, defer, or retry.
- Shed-oldest is explicitly out of scope for v1.
Centralized metrics
- Per call: tier, latency (semaphore wait / in-flight / total), token counts (in / out / cached), cost, retry count, outcome.
- Per tier: queue depth, in-flight count, slots free, rolling RPM/TPM.
- Routed through the existing telemetry pipeline.
Out of scope
- Priority queues / priority lanes. User-priority semantics are achieved via cancellation — see the design doc for why this is sufficient.
- Cross-process / cross-replica concurrency coordination.
- Replacing
ILlmClient or moving routing logic into the gateway.
- Streaming-response handling beyond ensuring cancellation works (audit pass should confirm whether any current callers stream).
Acceptance
Design reference
design/llm-gateway.md.
Implements the LLM gateway design (
design/llm-gateway.md).Adds a global chokepoint for every LLM call: per-tier concurrency caps, centralized retry on 429 (Retry-After-aware, exponential fallback), end-to-end cancellation, and centralized metrics. Trigger: parallelism in upcoming work (e.g. observation framework) makes the current ad-hoc rate-limit story untenable.
Scope
Wrap, don't replace
ILlmClient. Existing call sites unchanged at signature level —ctpropagation audit is the only call-site change required.Per-tier concurrency caps
SemaphoreSlimper tier (Low / Balanced / High).Retry on 429
Retry-Afterwhen present; fall back to exponential backoff (1s, 2s, 4s, 8s, capped at 16s).ct-aware (Task.Delay(_, ct)).Cancellation discipline
ctis mandatory at the gateway boundary (nodefaultoverload).ctflows through; fix any that passNone.ctduring a simulated long call aborts within a bounded time, per tier.Bounded queue depth
MaxPendingPerTierconfig. When exceeded, fail-fast with a "gateway saturated" error — caller decides whether to skip, defer, or retry.Centralized metrics
Out of scope
ILlmClientor moving routing logic into the gateway.Acceptance
Retry-Afterwaits the indicated duration; 429 withoutRetry-Afterfalls back to exponential backoff.ctduring a queued wait aborts the call promptly (test per tier).ctduring an in-flight call aborts the SDK request (test per tier).MaxPendingPerTierexceeded → caller receives "saturated" error rather than blocking indefinitely.ct. NoCancellationToken.None.Design reference
design/llm-gateway.md.