Skip to content

LLM gateway: per-tier concurrency cap, retry, ct propagation #352

@rockfordlhotka

Description

@rockfordlhotka

Implements the LLM gateway design (design/llm-gateway.md).

Adds a global chokepoint for every LLM call: per-tier concurrency caps, centralized retry on 429 (Retry-After-aware, exponential fallback), end-to-end cancellation, and centralized metrics. Trigger: parallelism in upcoming work (e.g. observation framework) makes the current ad-hoc rate-limit story untenable.

Scope

Wrap, don't replace

  • Thin gateway layer in front of ILlmClient. Existing call sites unchanged at signature level — ct propagation audit is the only call-site change required.
  • Provider SDK becomes the gateway's only direct consumer.

Per-tier concurrency caps

  • One SemaphoreSlim per tier (Low / Balanced / High).
  • Initial caps configurable; suggested defaults 8 / 4 / 2.
  • Caps are per-process. Cross-process coordination explicitly out of scope (see open questions in the design doc).

Retry on 429

  • Honor Retry-After when present; fall back to exponential backoff (1s, 2s, 4s, 8s, capped at 16s).
  • All sleeps are ct-aware (Task.Delay(_, ct)).
  • Slot is held during retry waits.
  • Configurable max attempts (default 5); after that, surface the failure to the caller.
  • Provider SDK retry MUST be disabled — gateway and SDK both retrying causes silent double-retry.

Cancellation discipline

  • ct is mandatory at the gateway boundary (no default overload).
  • Audit every existing LLM call site to confirm ct flows through; fix any that pass None.
  • Test that cancelling a ct during a simulated long call aborts within a bounded time, per tier.

Bounded queue depth

  • MaxPendingPerTier config. When exceeded, fail-fast with a "gateway saturated" error — caller decides whether to skip, defer, or retry.
  • Shed-oldest is explicitly out of scope for v1.

Centralized metrics

  • Per call: tier, latency (semaphore wait / in-flight / total), token counts (in / out / cached), cost, retry count, outcome.
  • Per tier: queue depth, in-flight count, slots free, rolling RPM/TPM.
  • Routed through the existing telemetry pipeline.

Out of scope

  • Priority queues / priority lanes. User-priority semantics are achieved via cancellation — see the design doc for why this is sufficient.
  • Cross-process / cross-replica concurrency coordination.
  • Replacing ILlmClient or moving routing logic into the gateway.
  • Streaming-response handling beyond ensuring cancellation works (audit pass should confirm whether any current callers stream).

Acceptance

  • All LLM calls route through the gateway; no call site bypasses it.
  • SDK retry is disabled and confirmed via test.
  • Per-tier caps apply correctly; test with a synthetic load that exceeds cap and confirms serialization at the cap.
  • 429 with Retry-After waits the indicated duration; 429 without Retry-After falls back to exponential backoff.
  • Cancelling ct during a queued wait aborts the call promptly (test per tier).
  • Cancelling ct during an in-flight call aborts the SDK request (test per tier).
  • MaxPendingPerTier exceeded → caller receives "saturated" error rather than blocking indefinitely.
  • Metrics emitted per call as specified in the design doc.
  • Audit pass: every LLM call site flows ct. No CancellationToken.None.

Design reference

design/llm-gateway.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions