LLM gateway: per-tier concurrency cap, retry, ct propagation

Implements the LLM gateway design (`design/llm-gateway.md`).

Adds a global chokepoint for every LLM call: per-tier concurrency caps, centralized retry on 429 (Retry-After-aware, exponential fallback), end-to-end cancellation, and centralized metrics. Trigger: parallelism in upcoming work (e.g. observation framework) makes the current ad-hoc rate-limit story untenable.

## Scope

**Wrap, don't replace**
- Thin gateway layer in front of `ILlmClient`. Existing call sites unchanged at signature level — `ct` propagation audit is the only call-site change required.
- Provider SDK becomes the gateway's only direct consumer.

**Per-tier concurrency caps**
- One `SemaphoreSlim` per tier (Low / Balanced / High).
- Initial caps configurable; suggested defaults 8 / 4 / 2.
- Caps are per-process. Cross-process coordination explicitly out of scope (see open questions in the design doc).

**Retry on 429**
- Honor `Retry-After` when present; fall back to exponential backoff (1s, 2s, 4s, 8s, capped at 16s).
- All sleeps are `ct`-aware (`Task.Delay(_, ct)`).
- Slot is held during retry waits.
- Configurable max attempts (default 5); after that, surface the failure to the caller.
- Provider SDK retry MUST be disabled — gateway and SDK both retrying causes silent double-retry.

**Cancellation discipline**
- `ct` is mandatory at the gateway boundary (no `default` overload).
- Audit every existing LLM call site to confirm `ct` flows through; fix any that pass `None`.
- Test that cancelling a `ct` during a simulated long call aborts within a bounded time, per tier.

**Bounded queue depth**
- `MaxPendingPerTier` config. When exceeded, fail-fast with a "gateway saturated" error — caller decides whether to skip, defer, or retry.
- Shed-oldest is explicitly out of scope for v1.

**Centralized metrics**
- Per call: tier, latency (semaphore wait / in-flight / total), token counts (in / out / cached), cost, retry count, outcome.
- Per tier: queue depth, in-flight count, slots free, rolling RPM/TPM.
- Routed through the existing telemetry pipeline.

## Out of scope

- Priority queues / priority lanes. User-priority semantics are achieved via cancellation — see the design doc for why this is sufficient.
- Cross-process / cross-replica concurrency coordination.
- Replacing `ILlmClient` or moving routing logic into the gateway.
- Streaming-response handling beyond ensuring cancellation works (audit pass should confirm whether any current callers stream).

## Acceptance

- [ ] All LLM calls route through the gateway; no call site bypasses it.
- [ ] SDK retry is disabled and confirmed via test.
- [ ] Per-tier caps apply correctly; test with a synthetic load that exceeds cap and confirms serialization at the cap.
- [ ] 429 with `Retry-After` waits the indicated duration; 429 without `Retry-After` falls back to exponential backoff.
- [ ] Cancelling `ct` during a queued wait aborts the call promptly (test per tier).
- [ ] Cancelling `ct` during an in-flight call aborts the SDK request (test per tier).
- [ ] `MaxPendingPerTier` exceeded → caller receives "saturated" error rather than blocking indefinitely.
- [ ] Metrics emitted per call as specified in the design doc.
- [ ] Audit pass: every LLM call site flows `ct`. No `CancellationToken.None`.

## Design reference

`design/llm-gateway.md`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM gateway: per-tier concurrency cap, retry, ct propagation #352

Scope

Out of scope

Acceptance

Design reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LLM gateway: per-tier concurrency cap, retry, ct propagation #352

Description

Scope

Out of scope

Acceptance

Design reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions