Enhancement — the original intent of the operator-controls branch, deferred with new nuance.
Context
The catalog-404 saga showed an Anthropic 404 can be transient (load-balanced backend, recovers on retry) OR permanent (retired/misconfigured model — e.g. `claude-3-haiku-20240307` selected by `prefer_cheapest`). Today both are handled alike, so a misconfigured model produces many slow, silent failures instead of one fast, loud one — and during ingest it looked like a stall.
To look into
- Classify provider errors: permanent 4xx config errors (model-not-found / not-entitled) → fail fast, surface clearly, optionally circuit-break the batch after the first; transient (429/529/some 404s) → let retries handle.
- A permanent config 404 shouldn't burn per-job retries — it should halt and tell the operator "your model is misconfigured."
- Relates to the catalog self-healing reconcile issue.
Enhancement — the original intent of the operator-controls branch, deferred with new nuance.
Context
The catalog-404 saga showed an Anthropic 404 can be transient (load-balanced backend, recovers on retry) OR permanent (retired/misconfigured model — e.g. `claude-3-haiku-20240307` selected by `prefer_cheapest`). Today both are handled alike, so a misconfigured model produces many slow, silent failures instead of one fast, loud one — and during ingest it looked like a stall.
To look into