|
| 1 | +--- |
| 2 | +description: Validate a model entry (or every model in a provider) in apps/sim/providers/models.ts against the provider's live API docs (no hallucination — reports what cannot be verified) |
| 3 | +argument-hint: <provider> [model-id] |
| 4 | +--- |
| 5 | + |
| 6 | +# Validate Model Skill |
| 7 | + |
| 8 | +You audit one or more model entries in `apps/sim/providers/models.ts` against the provider's official live API docs. **Hallucinated pricing and capabilities are the #1 failure mode in this file.** Every numeric and capability claim must be re-derived from a live web fetch in this session — not from memory, not from training data, not from the user's marketing email. |
| 9 | + |
| 10 | +## Hard rules (do not skip) |
| 11 | + |
| 12 | +1. **Live-fetch or report unverified.** Each field must be backed by a live WebFetch in this session. If you cannot reach an authoritative URL for a field, mark it **UNVERIFIED** in the report — do not silently confirm it from memory. |
| 13 | +2. **Cite every fact.** Every value in the report must show the source URL it was checked against. No URL → mark UNVERIFIED. |
| 14 | +3. **Two-source rule for pricing.** Cross-check input/output/cached against at least one secondary source (OpenRouter, Artificial Analysis, CloudPrice). If sources disagree, the provider's own docs win — flag the disagreement. |
| 15 | +4. **Inspect provider implementation before flagging capability mismatches.** A capability flag in `models.ts` is dead unless the provider's code under `apps/sim/providers/{provider}/` consumes it (see Consumption Matrix below). Setting a flag the provider ignores is a warning, not a critical. |
| 16 | +5. **Never auto-fix without printing the diff.** Show the user the proposed diff before applying. Get confirmation. |
| 17 | + |
| 18 | +## Your Task |
| 19 | + |
| 20 | +When invoked as `/validate-model <provider> [model-id]`: |
| 21 | + |
| 22 | +1. Read the target entries from `models.ts` |
| 23 | +2. Live-fetch the provider's official models, pricing, and capability/reasoning pages + at least one secondary source for pricing |
| 24 | +3. Inspect the provider implementation to know which flags are actually consumed |
| 25 | +4. Run the checklist below per model |
| 26 | +5. Report findings (critical / warning / suggestion / unverified) with every cell linked to its source URL |
| 27 | +6. Offer to fix; on confirm, edit `models.ts` in a single pass and re-lint |
| 28 | + |
| 29 | +If `model-id` is omitted, validate every model in the provider. |
| 30 | + |
| 31 | +## Step 1: Read entries from `models.ts` |
| 32 | + |
| 33 | +Capture per model: `id`, full `pricing`, full `capabilities`, `contextWindow`, `releaseDate`, `recommended`, `speedOptimized`, `deprecated`. |
| 34 | + |
| 35 | +## Step 2: Live-fetch authoritative sources |
| 36 | + |
| 37 | +Use the canonical provider URL table in `add-model.md` (Step 1) as the single source of truth — fetch the models index, pricing, and reasoning/parameter caveats pages listed there for the target provider. If you update one table, update the other in the same change. |
| 38 | + |
| 39 | +Secondary cross-check (use at least one): OpenRouter, Artificial Analysis, CloudPrice. |
| 40 | + |
| 41 | +If a fetch fails (404, timeout, paywall), record the URL attempted and mark dependent fields UNVERIFIED. |
| 42 | + |
| 43 | +## Step 3: Build the consumption map for this provider |
| 44 | + |
| 45 | +Re-grep before trusting the snapshot below: |
| 46 | + |
| 47 | +```bash |
| 48 | +rg "reasoningEffort|reasoning_effort" apps/sim/providers/<provider>/ |
| 49 | +rg "verbosity" apps/sim/providers/<provider>/ |
| 50 | +rg "request\.thinking|thinking:" apps/sim/providers/<provider>/ |
| 51 | +rg "supportsNativeStructuredOutputs|nativeStructuredOutputs" apps/sim/providers/<provider>/ |
| 52 | +``` |
| 53 | + |
| 54 | +Snapshot (verify before relying): |
| 55 | + |
| 56 | +| Capability | Consumed by | |
| 57 | +|---|---| |
| 58 | +| `reasoningEffort` | `openai/core.ts`, `azure-openai`, `anthropic/core.ts` (mapped via thinking), `gemini/core.ts` | |
| 59 | +| `verbosity` | `openai/core.ts`, `azure-openai/index.ts` | |
| 60 | +| `thinking` | `anthropic/core.ts`, `gemini/core.ts` | |
| 61 | +| `nativeStructuredOutputs` | `anthropic/core.ts`, `fireworks/index.ts`, `openrouter/index.ts` | |
| 62 | +| `computerUse` | `anthropic/core.ts` | |
| 63 | +| `temperature` | All providers (passthrough) | |
| 64 | + |
| 65 | +A flag set in `models.ts` but not in the consumption list for this provider = **warning: dead flag**. |
| 66 | + |
| 67 | +## Step 4: Run the checklist |
| 68 | + |
| 69 | +For each model, evaluate every row. Statuses: ✓ matches docs, ✗ disagrees, ⚠️ single-source, ❓ UNVERIFIED (could not fetch). |
| 70 | + |
| 71 | +### Identity |
| 72 | +- [ ] `id` exactly matches provider's API model identifier (case, dots, dashes, prefix for resellers) |
| 73 | +- [ ] `releaseDate` matches launch announcement |
| 74 | +- [ ] `deprecated: true` set if provider has announced retirement (or removed from active list) |
| 75 | + |
| 76 | +### Pricing (per 1M tokens, USD) |
| 77 | +- [ ] `pricing.input` matches provider pricing page |
| 78 | +- [ ] `pricing.output` matches provider pricing page |
| 79 | +- [ ] `pricing.cachedInput` matches provider's documented cached/prompt-cache rate (or is correctly omitted if no caching offered) |
| 80 | +- [ ] `pricing.updatedAt` is recent — warn if older than 60 days |
| 81 | + |
| 82 | +### Context & output limits |
| 83 | +- [ ] `contextWindow` matches docs (in tokens) |
| 84 | +- [ ] `capabilities.maxOutputTokens` matches documented output cap (or is correctly omitted if "no output limit") |
| 85 | + |
| 86 | +### Capabilities (each must be DOCUMENTED-AS-SUPPORTED **and** CONSUMED-BY-PROVIDER-CODE) |
| 87 | +- [ ] `temperature` — provider accepts it for this model (reasoning-always-on models often reject) |
| 88 | +- [ ] `reasoningEffort.values` — list matches docs; **omitted** for always-reasoning models that reject the parameter (e.g., grok-4.3, where xAI docs explicitly state `reasoning_effort` is not supported). Verify per model — some always-reasoning models (e.g., OpenAI's o-series) DO accept `reasoning_effort` and should keep the flag. |
| 89 | +- [ ] `verbosity.values` — only on OpenAI gpt-5.x family; values match docs |
| 90 | +- [ ] `thinking.levels` + `thinking.default` — only on Anthropic/Gemini; values match docs |
| 91 | +- [ ] `nativeStructuredOutputs` — only on anthropic/fireworks/openrouter; provider must document Structured Outputs / JSON-mode for this model |
| 92 | +- [ ] `toolUsageControl` — provider supports `tool_choice` semantics |
| 93 | +- [ ] `computerUse` — provider implements computer-use loop AND model is a computer-use SKU |
| 94 | +- [ ] `deepResearch` — only on actual deep-research SKUs |
| 95 | +- [ ] `memory: false` — only when the model genuinely cannot maintain conversation history |
| 96 | + |
| 97 | +### Flags |
| 98 | +- [ ] `recommended: true` — at most one or two per provider; should be current flagship |
| 99 | +- [ ] `speedOptimized: true` — only on smallest/fastest tier (nano / flash-lite / haiku class) |
| 100 | + |
| 101 | +## Step 5: Report (mandatory format) |
| 102 | + |
| 103 | +For each model, emit a table with one row per checklist item. Every row that claims ✓ must have a URL. |
| 104 | + |
| 105 | +```markdown |
| 106 | +### Validation — <model-id> |
| 107 | + |
| 108 | +| Field | Repo | Live docs | Source URL | Status | |
| 109 | +|---|---|---|---|---| |
| 110 | +| `input` | $1.25/M | $1.25/M | https://docs.x.ai/... | ✓ | |
| 111 | +| `cachedInput` | $0.50/M | $0.20/M | https://cloudprice.net/... | ✗ stale (price cut not picked up) | |
| 112 | +| `reasoningEffort` | low/medium/high | rejected by API | https://docs.x.ai/.../reasoning | ✗ inert — selecting silently no-ops | |
| 113 | +| `contextWindow` | 1,000,000 | 1,000,000 | https://docs.x.ai/... + https://openrouter.ai/... | ✓ (2 sources) | |
| 114 | +| `releaseDate` | 2026-04-30 | not found in scraped pages | _attempted: docs.x.ai, x.ai/news_ | ❓ UNVERIFIED | |
| 115 | + |
| 116 | +**Findings** |
| 117 | +- 🔴 critical — `cachedInput` is wrong: docs say $0.20/M, repo has $0.50/M |
| 118 | +- 🟡 warning — `reasoningEffort` is set but provider rejects it for this model (xAI docs explicitly: "reasoning_effort is not supported by grok-4.3") |
| 119 | +- 🔵 suggestion — `pricing.updatedAt` is 90 days old; refresh |
| 120 | +- ❓ unverified — `releaseDate` could not be confirmed from any fetched page; ask user |
| 121 | + |
| 122 | +**Disagreements between sources** |
| 123 | +- _none_ OR _OpenRouter says $X, provider docs say $Y — went with provider docs_ |
| 124 | +``` |
| 125 | + |
| 126 | +End each multi-model run with a summary count: `N models checked · X critical · Y warnings · Z suggestions · W unverified`. |
| 127 | + |
| 128 | +## Step 6: Offer to fix |
| 129 | + |
| 130 | +After reporting, ask: *"Want me to fix the critical and warning items? I'll print the diff first."* On yes: |
| 131 | + |
| 132 | +1. Print the proposed diff (do not apply yet) |
| 133 | +2. Get user confirmation |
| 134 | +3. Edit `models.ts` in a single pass |
| 135 | +4. Run `bun run lint` |
| 136 | +5. Re-run only the failed rows of the checklist on the new state |
| 137 | + |
| 138 | +## Severity definitions |
| 139 | + |
| 140 | +- 🔴 **critical** — wrong number or wrong identifier that misleads users about cost or breaks API calls. Examples: incorrect pricing, wrong model id, wrong context window, capability the API rejects. |
| 141 | +- 🟡 **warning** — dead code or internal inconsistency. Examples: capability flag the provider ignores, multiple `recommended: true` per provider, `pricing.updatedAt` >60 days old, missing `deprecated: true` on retired model. |
| 142 | +- 🔵 **suggestion** — style/consistency. Examples: field order, missing `speedOptimized` on a clearly smallest-tier model. |
| 143 | +- ❓ **unverified** — could not fetch an authoritative source for this field. Surface it; never silently confirm. |
| 144 | + |
| 145 | +## Common bugs this skill catches |
| 146 | + |
| 147 | +- Pricing drift after a provider price cut (very common — providers cut quarterly) |
| 148 | +- `reasoningEffort` set on always-reasoning models that reject the parameter (grok-4.3, o3-pro pattern) |
| 149 | +- `nativeStructuredOutputs` set on providers that don't consume the flag (dead) |
| 150 | +- `thinking` set on non-Anthropic/non-Gemini providers |
| 151 | +- `verbosity` set on non-gpt-5.x models |
| 152 | +- Wrong context window (e.g., 128k claimed vs 200k actual) |
| 153 | +- Stale `pricing.updatedAt` |
| 154 | +- Multiple `recommended: true` per provider after a flagship swap |
| 155 | +- Missing `deprecated: true` on retired models (e.g., the xAI batch retiring May 15, 2026) |
| 156 | + |
| 157 | +## What "I cannot verify this" looks like |
| 158 | + |
| 159 | +If, after fetching the documented sources, a field cannot be confirmed: |
| 160 | + |
| 161 | +- Mark the row ❓ UNVERIFIED with the URL(s) attempted |
| 162 | +- Surface it in the **Findings** section with severity ❓ |
| 163 | +- Do NOT mark the validation as passed |
| 164 | +- Ask the user for a docs URL or guidance before changing anything |
| 165 | + |
| 166 | +The skill is allowed to say *"I could not verify the cached input price for grok-4.3 from the official xAI docs in this session — I attempted [URLs] without finding the value. Third-party sources [URL1, URL2] both report $0.20/M. Confirm before I update."* That is correct behavior. Hallucinating a number is not. |
0 commit comments