Skip to content

Fail-fast on permanent provider errors (distinguish from transient) #475

@aaronsb

Description

@aaronsb

Enhancement — the original intent of the operator-controls branch, deferred with new nuance.

Context

The catalog-404 saga showed an Anthropic 404 can be transient (load-balanced backend, recovers on retry) OR permanent (retired/misconfigured model — e.g. `claude-3-haiku-20240307` selected by `prefer_cheapest`). Today both are handled alike, so a misconfigured model produces many slow, silent failures instead of one fast, loud one — and during ingest it looked like a stall.

To look into

  • Classify provider errors: permanent 4xx config errors (model-not-found / not-entitled) → fail fast, surface clearly, optionally circuit-break the batch after the first; transient (429/529/some 404s) → let retries handle.
  • A permanent config 404 shouldn't burn per-job retries — it should halt and tell the operator "your model is misconfigured."
  • Relates to the catalog self-healing reconcile issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions