chore(skills): add /add-model and /validate-model commands (#4475)

waleedlatif1 · web-flow · commit ad88859d6e2c · 2026-05-06T10:57:47.000-07:00
diff --git a/.claude/commands/add-model.md b/.claude/commands/add-model.md
@@ -0,0 +1,159 @@
+---
+description: Add a new LLM model to apps/sim/providers/models.ts with specs verified against the provider's live API docs (no hallucination)
+argument-hint: <provider> <model-id> [docs-url]
+---
+
+# Add Model Skill
+
+You add a new model entry to `apps/sim/providers/models.ts`. **Every numeric and capability claim MUST be derived from a live web fetch of the provider's official docs in this session.** Marketing emails, training data, and your prior knowledge are not sources of truth — they routinely hallucinate pricing, context windows, and capability lists.
+
+## Hard rules (do not skip)
+
+1. **Live-fetch or refuse.** Before writing the entry, you must successfully WebFetch the provider's official models/pricing page in this session. If you cannot reach an authoritative source for any field, **mark the field as UNVERIFIED in your report and ask the user before guessing**. Never fill in pricing or capabilities from memory.
+2. **Two-source rule for pricing.** Cross-check input/output/cached pricing against at least one secondary source (OpenRouter, Artificial Analysis, CloudPrice, mem0, intuitionlabs). If sources disagree, the provider's own docs win — but flag the disagreement.
+3. **Read the code before setting capability flags.** Capability flags are dead unless the provider's implementation under `apps/sim/providers/{provider}/` actually consumes them (see Consumption Matrix below). Setting a flag the provider ignores is a silent bug.
+4. **Cite every fact.** Your final report must list the URL each value came from. No URL → not verified.
+
+## Your Task
+
+1. Identify provider and model id from user args
+2. Live-fetch official docs + pricing page + capability/parameter pages + at least one secondary source
+3. Apply the Consumption Matrix to know which capability flags are real
+4. Read 2-3 sibling entries in `models.ts` and match their pattern exactly
+5. Insert the entry, run `bun run lint`, print the verification report
+
+## Step 1: Live source-of-truth lookup
+
+In priority order — fetch all that exist for the provider:
+
+| Provider | Models index | Pricing | Reasoning/parameter caveats |
+|---|---|---|---|
+| OpenAI | platform.openai.com/docs/models | openai.com/api/pricing | platform.openai.com/docs/guides/reasoning |
+| Anthropic | docs.anthropic.com/en/docs/about-claude/models | anthropic.com/pricing | docs.anthropic.com/en/docs/build-with-claude/extended-thinking |
+| Google (Gemini) | ai.google.dev/gemini-api/docs/models | ai.google.dev/pricing | ai.google.dev/gemini-api/docs/thinking |
+| xAI | docs.x.ai/developers/models | docs.x.ai/developers/models (per-model detail page) | docs.x.ai/developers/model-capabilities/text/reasoning |
+| Mistral | docs.mistral.ai/getting-started/models/models_overview | mistral.ai/pricing | n/a |
+| DeepSeek | api-docs.deepseek.com/quick_start/pricing | same | api-docs.deepseek.com/guides/reasoning_model |
+| Groq | console.groq.com/docs/models | groq.com/pricing | n/a |
+| Cerebras | inference-docs.cerebras.ai/models | cerebras.ai/pricing | n/a |
+
+Secondary verification (use at least one): `openrouter.ai/<provider>/<model>`, `artificialanalysis.ai/models/<model>`, `cloudprice.net/models/<provider>-<model>`.
+
+Use a precise WebFetch prompt: *"Extract for {model_id}: exact model id string, context window in tokens, input price per 1M, cached input price per 1M, output price per 1M, max output tokens, supported reasoning effort levels, accepted parameters (temperature, top_p), release date. Do not fill in fields you cannot find."*
+
+## Step 2: Consumption Matrix (which provider honors which capability)
+
+| Capability | Honored by | Effect if set elsewhere |
+|---|---|---|
+| `temperature` | All providers (passed through if set) | Safe but inert on always-reasoning models that reject it |
+| `toolUsageControl` | All providers (provider-level, not per-model) | n/a — set on `ProviderDefinition`, not models |
+| `reasoningEffort` | `openai/core.ts`, `azure-openai`, `anthropic/core.ts` (mapped to thinking), `gemini/core.ts` | **Dead on xai, deepseek, mistral, groq, cerebras, openrouter, fireworks, bedrock, vertex** unless their core consumes it — re-grep before assuming |
+| `verbosity` | `openai/core.ts`, `azure-openai/index.ts` only | Dead elsewhere |
+| `thinking` | `anthropic/core.ts`, `gemini/core.ts` | Dead elsewhere |
+| `nativeStructuredOutputs` | `anthropic/core.ts`, `fireworks/index.ts`, `openrouter/index.ts` | Dead on openai, xai, google, vertex, bedrock, azure-openai, deepseek, mistral, groq, cerebras |
+| `maxOutputTokens` | Read by UI + executor for token estimation | Always meaningful — set if provider documents a cap |
+| `computerUse` | `anthropic/core.ts` | Dead elsewhere |
+| `deepResearch` | UI flag for routing to deep-research SKUs | Set only on actual deep-research model IDs |
+| `memory: false` | Conversation persistence opt-out | Set only when model genuinely cannot maintain history (e.g., deep-research) |
+
+**Always re-grep before relying on this table** — the codebase moves:
+
+```bash
+rg "reasoningEffort|reasoning_effort" apps/sim/providers/<provider>/
+rg "verbosity" apps/sim/providers/<provider>/
+rg "request\.thinking|thinking:" apps/sim/providers/<provider>/
+rg "supportsNativeStructuredOutputs|nativeStructuredOutputs" apps/sim/providers/<provider>/
+```
+
+## Step 3: Match the provider's existing entry pattern
+
+Open `apps/sim/providers/models.ts`, find `PROVIDER_DEFINITIONS[<provider>].models`, read 2-3 sibling entries. Match field order exactly:
+
+```ts
+{
+  id: '<exact-api-id>',
+  pricing: {
+    input: <number>,
+    cachedInput: <number>,  // omit if provider doesn't offer caching
+    output: <number>,
+    updatedAt: '<today YYYY-MM-DD>',
+  },
+  capabilities: {
+    // only flags the provider actually consumes — see matrix
+  },
+  contextWindow: <tokens>,
+  releaseDate: '<YYYY-MM-DD>',
+  recommended: true,        // only if new flagship; ask user before swapping
+  speedOptimized: true,     // only on smallest/fastest tier
+  deprecated: true,         // only on retired models
+}
+```
+
+### Reseller providers (azure-openai, azure-anthropic, vertex, bedrock, openrouter)
+
+Model id MUST be prefixed: `azure/`, `azure-anthropic/`, `vertex/`, `bedrock/`, `openrouter/`. Pricing usually mirrors the upstream provider but verify on the reseller's own pricing page.
+
+### Insertion order
+
+Within a family, newest first (matches existing convention: GPT-5.5 above GPT-5.4 above GPT-5.2). Across families, biggest/flagship at top of list.
+
+### `recommended` / `speedOptimized`
+
+- At most one or two `recommended: true` per provider — the current flagship(s).
+- If you're adding a new flagship, ask the user before removing `recommended` from the previous flagship. Never silently flip it.
+- `speedOptimized: true` only on the smallest/fastest tier (nano, flash-lite, haiku class).
+
+## Step 4: Write, lint
+
+```bash
+bun run lint
+```
+
+Lint must pass before reporting done. **If lint fails:** read the error, fix the syntax/typing issue in the entry you just wrote (do not delete the entry — it's the work product), re-run lint, and note the fix in a "Lint adjustments" line in the verification report. Never report done with lint failing.
+
+## Step 5: Verification report (mandatory format)
+
+End with this exact structure:
+
+```markdown
+### Verification — <model-id>
+
+| Field | Value | Source URL | Status |
+|---|---|---|---|
+| `id` | `grok-4.3` | https://docs.x.ai/... | ✓ verified |
+| `contextWindow` | 1,000,000 | https://docs.x.ai/... + https://openrouter.ai/... | ✓ verified (2 sources agree) |
+| `input` | $1.25/M | https://docs.x.ai/... | ✓ verified |
+| `cachedInput` | $0.20/M | https://cloudprice.net/... | ⚠️ single source |
+| `output` | $2.50/M | https://docs.x.ai/... + https://openrouter.ai/... | ✓ verified |
+| `capabilities.temperature` | `{ min: 0, max: 1 }` | matches sibling entries | — pattern-match only |
+| `capabilities.reasoningEffort` | NOT SET | provider docs say API rejects it for this model | ✓ correctly omitted |
+| `releaseDate` | 2026-04-30 | https://docs.x.ai/... announcement | ✓ verified |
+
+**Disagreements**
+- _none_ OR _OpenRouter says X, provider docs say Y — used Y per provider rule_
+
+**Unverified fields**
+- _none_ OR _<field>: could not find authoritative source — left as <X> based on sibling pattern; please confirm_
+```
+
+If any row is ⚠️ single-source or "unverified," **state it plainly to the user and ask whether to proceed**. Do not silently merge.
+
+## What to do if you cannot find a source
+
+Omitting a field is **not the same as verifying it**. Any field you cannot confirm from a live fetch must be **both** omitted from the entry **and** listed as ❓ UNVERIFIED in the report's "Unverified fields" section, with the URLs you attempted. Then ask the user to confirm before merging.
+
+- Pricing missing → do NOT guess. Omit `cachedInput`. Mark ❓ UNVERIFIED. Ask the user for the price or the docs URL.
+- Context window missing → do NOT guess. Ask the user; mark ❓ UNVERIFIED.
+- Release date missing → omit the field; mark ❓ UNVERIFIED in the report.
+- Capability uncertain → omit the flag (safer than setting a dead/wrong one); mark ❓ UNVERIFIED so the user knows you didn't confirm it either way.
+
+## Anti-patterns this skill exists to prevent
+
+- ❌ Trusting a marketing email (xAI's grok-4.3 email claimed "3 reasoning efforts" but the API rejects `reasoning_effort` — verified by official docs only)
+- ❌ Setting `nativeStructuredOutputs: true` on xai/openai/google (dead — only anthropic/fireworks/openrouter consume it)
+- ❌ Setting `thinking` on non-Anthropic/non-Gemini providers
+- ❌ Setting `verbosity` on anything other than OpenAI gpt-5.x
+- ❌ Copying `pricing.updatedAt` from a sibling instead of using today's date
+- ❌ Inventing a `cachedInput` price by dividing input by 4 (varies by provider — find an explicit number)
+- ❌ Stamping `recommended: true` on the new model without removing it from the previous flagship
+- ❌ Reporting "done" with any UNVERIFIED row in the table
diff --git a/.claude/commands/validate-model.md b/.claude/commands/validate-model.md
@@ -0,0 +1,166 @@
+---
+description: Validate a model entry (or every model in a provider) in apps/sim/providers/models.ts against the provider's live API docs (no hallucination — reports what cannot be verified)
+argument-hint: <provider> [model-id]
+---
+
+# Validate Model Skill
+
+You audit one or more model entries in `apps/sim/providers/models.ts` against the provider's official live API docs. **Hallucinated pricing and capabilities are the #1 failure mode in this file.** Every numeric and capability claim must be re-derived from a live web fetch in this session — not from memory, not from training data, not from the user's marketing email.
+
+## Hard rules (do not skip)
+
+1. **Live-fetch or report unverified.** Each field must be backed by a live WebFetch in this session. If you cannot reach an authoritative URL for a field, mark it **UNVERIFIED** in the report — do not silently confirm it from memory.
+2. **Cite every fact.** Every value in the report must show the source URL it was checked against. No URL → mark UNVERIFIED.
+3. **Two-source rule for pricing.** Cross-check input/output/cached against at least one secondary source (OpenRouter, Artificial Analysis, CloudPrice). If sources disagree, the provider's own docs win — flag the disagreement.
+4. **Inspect provider implementation before flagging capability mismatches.** A capability flag in `models.ts` is dead unless the provider's code under `apps/sim/providers/{provider}/` consumes it (see Consumption Matrix below). Setting a flag the provider ignores is a warning, not a critical.
+5. **Never auto-fix without printing the diff.** Show the user the proposed diff before applying. Get confirmation.
+
+## Your Task
+
+When invoked as `/validate-model <provider> [model-id]`:
+
+1. Read the target entries from `models.ts`
+2. Live-fetch the provider's official models, pricing, and capability/reasoning pages + at least one secondary source for pricing
+3. Inspect the provider implementation to know which flags are actually consumed
+4. Run the checklist below per model
+5. Report findings (critical / warning / suggestion / unverified) with every cell linked to its source URL
+6. Offer to fix; on confirm, edit `models.ts` in a single pass and re-lint
+
+If `model-id` is omitted, validate every model in the provider.
+
+## Step 1: Read entries from `models.ts`
+
+Capture per model: `id`, full `pricing`, full `capabilities`, `contextWindow`, `releaseDate`, `recommended`, `speedOptimized`, `deprecated`.
+
+## Step 2: Live-fetch authoritative sources
+
+Use the canonical provider URL table in `add-model.md` (Step 1) as the single source of truth — fetch the models index, pricing, and reasoning/parameter caveats pages listed there for the target provider. If you update one table, update the other in the same change.
+
+Secondary cross-check (use at least one): OpenRouter, Artificial Analysis, CloudPrice.
+
+If a fetch fails (404, timeout, paywall), record the URL attempted and mark dependent fields UNVERIFIED.
+
+## Step 3: Build the consumption map for this provider
+
+Re-grep before trusting the snapshot below:
+
+```bash
+rg "reasoningEffort|reasoning_effort" apps/sim/providers/<provider>/
+rg "verbosity" apps/sim/providers/<provider>/
+rg "request\.thinking|thinking:" apps/sim/providers/<provider>/
+rg "supportsNativeStructuredOutputs|nativeStructuredOutputs" apps/sim/providers/<provider>/
+```
+
+Snapshot (verify before relying):
+
+| Capability | Consumed by |
+|---|---|
+| `reasoningEffort` | `openai/core.ts`, `azure-openai`, `anthropic/core.ts` (mapped via thinking), `gemini/core.ts` |
+| `verbosity` | `openai/core.ts`, `azure-openai/index.ts` |
+| `thinking` | `anthropic/core.ts`, `gemini/core.ts` |
+| `nativeStructuredOutputs` | `anthropic/core.ts`, `fireworks/index.ts`, `openrouter/index.ts` |
+| `computerUse` | `anthropic/core.ts` |
+| `temperature` | All providers (passthrough) |
+
+A flag set in `models.ts` but not in the consumption list for this provider = **warning: dead flag**.
+
+## Step 4: Run the checklist
+
+For each model, evaluate every row. Statuses: ✓ matches docs, ✗ disagrees, ⚠️ single-source, ❓ UNVERIFIED (could not fetch).
+
+### Identity
+- [ ] `id` exactly matches provider's API model identifier (case, dots, dashes, prefix for resellers)
+- [ ] `releaseDate` matches launch announcement
+- [ ] `deprecated: true` set if provider has announced retirement (or removed from active list)
+
+### Pricing (per 1M tokens, USD)
+- [ ] `pricing.input` matches provider pricing page
+- [ ] `pricing.output` matches provider pricing page
+- [ ] `pricing.cachedInput` matches provider's documented cached/prompt-cache rate (or is correctly omitted if no caching offered)
+- [ ] `pricing.updatedAt` is recent — warn if older than 60 days
+
+### Context & output limits
+- [ ] `contextWindow` matches docs (in tokens)
+- [ ] `capabilities.maxOutputTokens` matches documented output cap (or is correctly omitted if "no output limit")
+
+### Capabilities (each must be DOCUMENTED-AS-SUPPORTED **and** CONSUMED-BY-PROVIDER-CODE)
+- [ ] `temperature` — provider accepts it for this model (reasoning-always-on models often reject)
+- [ ] `reasoningEffort.values` — list matches docs; **omitted** for always-reasoning models that reject the parameter (e.g., grok-4.3, where xAI docs explicitly state `reasoning_effort` is not supported). Verify per model — some always-reasoning models (e.g., OpenAI's o-series) DO accept `reasoning_effort` and should keep the flag.
+- [ ] `verbosity.values` — only on OpenAI gpt-5.x family; values match docs
+- [ ] `thinking.levels` + `thinking.default` — only on Anthropic/Gemini; values match docs
+- [ ] `nativeStructuredOutputs` — only on anthropic/fireworks/openrouter; provider must document Structured Outputs / JSON-mode for this model
+- [ ] `toolUsageControl` — provider supports `tool_choice` semantics
+- [ ] `computerUse` — provider implements computer-use loop AND model is a computer-use SKU
+- [ ] `deepResearch` — only on actual deep-research SKUs
+- [ ] `memory: false` — only when the model genuinely cannot maintain conversation history
+
+### Flags
+- [ ] `recommended: true` — at most one or two per provider; should be current flagship
+- [ ] `speedOptimized: true` — only on smallest/fastest tier (nano / flash-lite / haiku class)
+
+## Step 5: Report (mandatory format)
+
+For each model, emit a table with one row per checklist item. Every row that claims ✓ must have a URL.
+
+```markdown
+### Validation — <model-id>
+
+| Field | Repo | Live docs | Source URL | Status |
+|---|---|---|---|---|
+| `input` | $1.25/M | $1.25/M | https://docs.x.ai/... | ✓ |
+| `cachedInput` | $0.50/M | $0.20/M | https://cloudprice.net/... | ✗ stale (price cut not picked up) |
+| `reasoningEffort` | low/medium/high | rejected by API | https://docs.x.ai/.../reasoning | ✗ inert — selecting silently no-ops |
+| `contextWindow` | 1,000,000 | 1,000,000 | https://docs.x.ai/... + https://openrouter.ai/... | ✓ (2 sources) |
+| `releaseDate` | 2026-04-30 | not found in scraped pages | _attempted: docs.x.ai, x.ai/news_ | ❓ UNVERIFIED |
+
+**Findings**
+- 🔴 critical — `cachedInput` is wrong: docs say $0.20/M, repo has $0.50/M
+- 🟡 warning — `reasoningEffort` is set but provider rejects it for this model (xAI docs explicitly: "reasoning_effort is not supported by grok-4.3")
+- 🔵 suggestion — `pricing.updatedAt` is 90 days old; refresh
+- ❓ unverified — `releaseDate` could not be confirmed from any fetched page; ask user
+
+**Disagreements between sources**
+- _none_ OR _OpenRouter says $X, provider docs say $Y — went with provider docs_
+```
+
+End each multi-model run with a summary count: `N models checked · X critical · Y warnings · Z suggestions · W unverified`.
+
+## Step 6: Offer to fix
+
+After reporting, ask: *"Want me to fix the critical and warning items? I'll print the diff first."* On yes:
+
+1. Print the proposed diff (do not apply yet)
+2. Get user confirmation
+3. Edit `models.ts` in a single pass
+4. Run `bun run lint`
+5. Re-run only the failed rows of the checklist on the new state
+
+## Severity definitions
+
+- 🔴 **critical** — wrong number or wrong identifier that misleads users about cost or breaks API calls. Examples: incorrect pricing, wrong model id, wrong context window, capability the API rejects.
+- 🟡 **warning** — dead code or internal inconsistency. Examples: capability flag the provider ignores, multiple `recommended: true` per provider, `pricing.updatedAt` >60 days old, missing `deprecated: true` on retired model.
+- 🔵 **suggestion** — style/consistency. Examples: field order, missing `speedOptimized` on a clearly smallest-tier model.
+- ❓ **unverified** — could not fetch an authoritative source for this field. Surface it; never silently confirm.
+
+## Common bugs this skill catches
+
+- Pricing drift after a provider price cut (very common — providers cut quarterly)
+- `reasoningEffort` set on always-reasoning models that reject the parameter (grok-4.3, o3-pro pattern)
+- `nativeStructuredOutputs` set on providers that don't consume the flag (dead)
+- `thinking` set on non-Anthropic/non-Gemini providers
+- `verbosity` set on non-gpt-5.x models
+- Wrong context window (e.g., 128k claimed vs 200k actual)
+- Stale `pricing.updatedAt`
+- Multiple `recommended: true` per provider after a flagship swap
+- Missing `deprecated: true` on retired models (e.g., the xAI batch retiring May 15, 2026)
+
+## What "I cannot verify this" looks like
+
+If, after fetching the documented sources, a field cannot be confirmed:
+
+- Mark the row ❓ UNVERIFIED with the URL(s) attempted
+- Surface it in the **Findings** section with severity ❓
+- Do NOT mark the validation as passed
+- Ask the user for a docs URL or guidance before changing anything
+
+The skill is allowed to say *"I could not verify the cached input price for grok-4.3 from the official xAI docs in this session — I attempted [URLs] without finding the value. Third-party sources [URL1, URL2] both report $0.20/M. Confirm before I update."* That is correct behavior. Hallucinating a number is not.