Skip to content

Commit ad88859

Browse files
authored
chore(skills): add /add-model and /validate-model commands (#4475)
1 parent 5d38222 commit ad88859

2 files changed

Lines changed: 325 additions & 0 deletions

File tree

.claude/commands/add-model.md

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
---
2+
description: Add a new LLM model to apps/sim/providers/models.ts with specs verified against the provider's live API docs (no hallucination)
3+
argument-hint: <provider> <model-id> [docs-url]
4+
---
5+
6+
# Add Model Skill
7+
8+
You add a new model entry to `apps/sim/providers/models.ts`. **Every numeric and capability claim MUST be derived from a live web fetch of the provider's official docs in this session.** Marketing emails, training data, and your prior knowledge are not sources of truth — they routinely hallucinate pricing, context windows, and capability lists.
9+
10+
## Hard rules (do not skip)
11+
12+
1. **Live-fetch or refuse.** Before writing the entry, you must successfully WebFetch the provider's official models/pricing page in this session. If you cannot reach an authoritative source for any field, **mark the field as UNVERIFIED in your report and ask the user before guessing**. Never fill in pricing or capabilities from memory.
13+
2. **Two-source rule for pricing.** Cross-check input/output/cached pricing against at least one secondary source (OpenRouter, Artificial Analysis, CloudPrice, mem0, intuitionlabs). If sources disagree, the provider's own docs win — but flag the disagreement.
14+
3. **Read the code before setting capability flags.** Capability flags are dead unless the provider's implementation under `apps/sim/providers/{provider}/` actually consumes them (see Consumption Matrix below). Setting a flag the provider ignores is a silent bug.
15+
4. **Cite every fact.** Your final report must list the URL each value came from. No URL → not verified.
16+
17+
## Your Task
18+
19+
1. Identify provider and model id from user args
20+
2. Live-fetch official docs + pricing page + capability/parameter pages + at least one secondary source
21+
3. Apply the Consumption Matrix to know which capability flags are real
22+
4. Read 2-3 sibling entries in `models.ts` and match their pattern exactly
23+
5. Insert the entry, run `bun run lint`, print the verification report
24+
25+
## Step 1: Live source-of-truth lookup
26+
27+
In priority order — fetch all that exist for the provider:
28+
29+
| Provider | Models index | Pricing | Reasoning/parameter caveats |
30+
|---|---|---|---|
31+
| OpenAI | platform.openai.com/docs/models | openai.com/api/pricing | platform.openai.com/docs/guides/reasoning |
32+
| Anthropic | docs.anthropic.com/en/docs/about-claude/models | anthropic.com/pricing | docs.anthropic.com/en/docs/build-with-claude/extended-thinking |
33+
| Google (Gemini) | ai.google.dev/gemini-api/docs/models | ai.google.dev/pricing | ai.google.dev/gemini-api/docs/thinking |
34+
| xAI | docs.x.ai/developers/models | docs.x.ai/developers/models (per-model detail page) | docs.x.ai/developers/model-capabilities/text/reasoning |
35+
| Mistral | docs.mistral.ai/getting-started/models/models_overview | mistral.ai/pricing | n/a |
36+
| DeepSeek | api-docs.deepseek.com/quick_start/pricing | same | api-docs.deepseek.com/guides/reasoning_model |
37+
| Groq | console.groq.com/docs/models | groq.com/pricing | n/a |
38+
| Cerebras | inference-docs.cerebras.ai/models | cerebras.ai/pricing | n/a |
39+
40+
Secondary verification (use at least one): `openrouter.ai/<provider>/<model>`, `artificialanalysis.ai/models/<model>`, `cloudprice.net/models/<provider>-<model>`.
41+
42+
Use a precise WebFetch prompt: *"Extract for {model_id}: exact model id string, context window in tokens, input price per 1M, cached input price per 1M, output price per 1M, max output tokens, supported reasoning effort levels, accepted parameters (temperature, top_p), release date. Do not fill in fields you cannot find."*
43+
44+
## Step 2: Consumption Matrix (which provider honors which capability)
45+
46+
| Capability | Honored by | Effect if set elsewhere |
47+
|---|---|---|
48+
| `temperature` | All providers (passed through if set) | Safe but inert on always-reasoning models that reject it |
49+
| `toolUsageControl` | All providers (provider-level, not per-model) | n/a — set on `ProviderDefinition`, not models |
50+
| `reasoningEffort` | `openai/core.ts`, `azure-openai`, `anthropic/core.ts` (mapped to thinking), `gemini/core.ts` | **Dead on xai, deepseek, mistral, groq, cerebras, openrouter, fireworks, bedrock, vertex** unless their core consumes it — re-grep before assuming |
51+
| `verbosity` | `openai/core.ts`, `azure-openai/index.ts` only | Dead elsewhere |
52+
| `thinking` | `anthropic/core.ts`, `gemini/core.ts` | Dead elsewhere |
53+
| `nativeStructuredOutputs` | `anthropic/core.ts`, `fireworks/index.ts`, `openrouter/index.ts` | Dead on openai, xai, google, vertex, bedrock, azure-openai, deepseek, mistral, groq, cerebras |
54+
| `maxOutputTokens` | Read by UI + executor for token estimation | Always meaningful — set if provider documents a cap |
55+
| `computerUse` | `anthropic/core.ts` | Dead elsewhere |
56+
| `deepResearch` | UI flag for routing to deep-research SKUs | Set only on actual deep-research model IDs |
57+
| `memory: false` | Conversation persistence opt-out | Set only when model genuinely cannot maintain history (e.g., deep-research) |
58+
59+
**Always re-grep before relying on this table** — the codebase moves:
60+
61+
```bash
62+
rg "reasoningEffort|reasoning_effort" apps/sim/providers/<provider>/
63+
rg "verbosity" apps/sim/providers/<provider>/
64+
rg "request\.thinking|thinking:" apps/sim/providers/<provider>/
65+
rg "supportsNativeStructuredOutputs|nativeStructuredOutputs" apps/sim/providers/<provider>/
66+
```
67+
68+
## Step 3: Match the provider's existing entry pattern
69+
70+
Open `apps/sim/providers/models.ts`, find `PROVIDER_DEFINITIONS[<provider>].models`, read 2-3 sibling entries. Match field order exactly:
71+
72+
```ts
73+
{
74+
id: '<exact-api-id>',
75+
pricing: {
76+
input: <number>,
77+
cachedInput: <number>, // omit if provider doesn't offer caching
78+
output: <number>,
79+
updatedAt: '<today YYYY-MM-DD>',
80+
},
81+
capabilities: {
82+
// only flags the provider actually consumes — see matrix
83+
},
84+
contextWindow: <tokens>,
85+
releaseDate: '<YYYY-MM-DD>',
86+
recommended: true, // only if new flagship; ask user before swapping
87+
speedOptimized: true, // only on smallest/fastest tier
88+
deprecated: true, // only on retired models
89+
}
90+
```
91+
92+
### Reseller providers (azure-openai, azure-anthropic, vertex, bedrock, openrouter)
93+
94+
Model id MUST be prefixed: `azure/`, `azure-anthropic/`, `vertex/`, `bedrock/`, `openrouter/`. Pricing usually mirrors the upstream provider but verify on the reseller's own pricing page.
95+
96+
### Insertion order
97+
98+
Within a family, newest first (matches existing convention: GPT-5.5 above GPT-5.4 above GPT-5.2). Across families, biggest/flagship at top of list.
99+
100+
### `recommended` / `speedOptimized`
101+
102+
- At most one or two `recommended: true` per provider — the current flagship(s).
103+
- If you're adding a new flagship, ask the user before removing `recommended` from the previous flagship. Never silently flip it.
104+
- `speedOptimized: true` only on the smallest/fastest tier (nano, flash-lite, haiku class).
105+
106+
## Step 4: Write, lint
107+
108+
```bash
109+
bun run lint
110+
```
111+
112+
Lint must pass before reporting done. **If lint fails:** read the error, fix the syntax/typing issue in the entry you just wrote (do not delete the entry — it's the work product), re-run lint, and note the fix in a "Lint adjustments" line in the verification report. Never report done with lint failing.
113+
114+
## Step 5: Verification report (mandatory format)
115+
116+
End with this exact structure:
117+
118+
```markdown
119+
### Verification — <model-id>
120+
121+
| Field | Value | Source URL | Status |
122+
|---|---|---|---|
123+
| `id` | `grok-4.3` | https://docs.x.ai/... | ✓ verified |
124+
| `contextWindow` | 1,000,000 | https://docs.x.ai/... + https://openrouter.ai/... | ✓ verified (2 sources agree) |
125+
| `input` | $1.25/M | https://docs.x.ai/... | ✓ verified |
126+
| `cachedInput` | $0.20/M | https://cloudprice.net/... | ⚠️ single source |
127+
| `output` | $2.50/M | https://docs.x.ai/... + https://openrouter.ai/... | ✓ verified |
128+
| `capabilities.temperature` | `{ min: 0, max: 1 }` | matches sibling entries | — pattern-match only |
129+
| `capabilities.reasoningEffort` | NOT SET | provider docs say API rejects it for this model | ✓ correctly omitted |
130+
| `releaseDate` | 2026-04-30 | https://docs.x.ai/... announcement | ✓ verified |
131+
132+
**Disagreements**
133+
- _none_ OR _OpenRouter says X, provider docs say Y — used Y per provider rule_
134+
135+
**Unverified fields**
136+
- _none_ OR _<field>: could not find authoritative source — left as <X> based on sibling pattern; please confirm_
137+
```
138+
139+
If any row is ⚠️ single-source or "unverified," **state it plainly to the user and ask whether to proceed**. Do not silently merge.
140+
141+
## What to do if you cannot find a source
142+
143+
Omitting a field is **not the same as verifying it**. Any field you cannot confirm from a live fetch must be **both** omitted from the entry **and** listed as ❓ UNVERIFIED in the report's "Unverified fields" section, with the URLs you attempted. Then ask the user to confirm before merging.
144+
145+
- Pricing missing → do NOT guess. Omit `cachedInput`. Mark ❓ UNVERIFIED. Ask the user for the price or the docs URL.
146+
- Context window missing → do NOT guess. Ask the user; mark ❓ UNVERIFIED.
147+
- Release date missing → omit the field; mark ❓ UNVERIFIED in the report.
148+
- Capability uncertain → omit the flag (safer than setting a dead/wrong one); mark ❓ UNVERIFIED so the user knows you didn't confirm it either way.
149+
150+
## Anti-patterns this skill exists to prevent
151+
152+
- ❌ Trusting a marketing email (xAI's grok-4.3 email claimed "3 reasoning efforts" but the API rejects `reasoning_effort` — verified by official docs only)
153+
- ❌ Setting `nativeStructuredOutputs: true` on xai/openai/google (dead — only anthropic/fireworks/openrouter consume it)
154+
- ❌ Setting `thinking` on non-Anthropic/non-Gemini providers
155+
- ❌ Setting `verbosity` on anything other than OpenAI gpt-5.x
156+
- ❌ Copying `pricing.updatedAt` from a sibling instead of using today's date
157+
- ❌ Inventing a `cachedInput` price by dividing input by 4 (varies by provider — find an explicit number)
158+
- ❌ Stamping `recommended: true` on the new model without removing it from the previous flagship
159+
- ❌ Reporting "done" with any UNVERIFIED row in the table

.claude/commands/validate-model.md

Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
---
2+
description: Validate a model entry (or every model in a provider) in apps/sim/providers/models.ts against the provider's live API docs (no hallucination — reports what cannot be verified)
3+
argument-hint: <provider> [model-id]
4+
---
5+
6+
# Validate Model Skill
7+
8+
You audit one or more model entries in `apps/sim/providers/models.ts` against the provider's official live API docs. **Hallucinated pricing and capabilities are the #1 failure mode in this file.** Every numeric and capability claim must be re-derived from a live web fetch in this session — not from memory, not from training data, not from the user's marketing email.
9+
10+
## Hard rules (do not skip)
11+
12+
1. **Live-fetch or report unverified.** Each field must be backed by a live WebFetch in this session. If you cannot reach an authoritative URL for a field, mark it **UNVERIFIED** in the report — do not silently confirm it from memory.
13+
2. **Cite every fact.** Every value in the report must show the source URL it was checked against. No URL → mark UNVERIFIED.
14+
3. **Two-source rule for pricing.** Cross-check input/output/cached against at least one secondary source (OpenRouter, Artificial Analysis, CloudPrice). If sources disagree, the provider's own docs win — flag the disagreement.
15+
4. **Inspect provider implementation before flagging capability mismatches.** A capability flag in `models.ts` is dead unless the provider's code under `apps/sim/providers/{provider}/` consumes it (see Consumption Matrix below). Setting a flag the provider ignores is a warning, not a critical.
16+
5. **Never auto-fix without printing the diff.** Show the user the proposed diff before applying. Get confirmation.
17+
18+
## Your Task
19+
20+
When invoked as `/validate-model <provider> [model-id]`:
21+
22+
1. Read the target entries from `models.ts`
23+
2. Live-fetch the provider's official models, pricing, and capability/reasoning pages + at least one secondary source for pricing
24+
3. Inspect the provider implementation to know which flags are actually consumed
25+
4. Run the checklist below per model
26+
5. Report findings (critical / warning / suggestion / unverified) with every cell linked to its source URL
27+
6. Offer to fix; on confirm, edit `models.ts` in a single pass and re-lint
28+
29+
If `model-id` is omitted, validate every model in the provider.
30+
31+
## Step 1: Read entries from `models.ts`
32+
33+
Capture per model: `id`, full `pricing`, full `capabilities`, `contextWindow`, `releaseDate`, `recommended`, `speedOptimized`, `deprecated`.
34+
35+
## Step 2: Live-fetch authoritative sources
36+
37+
Use the canonical provider URL table in `add-model.md` (Step 1) as the single source of truth — fetch the models index, pricing, and reasoning/parameter caveats pages listed there for the target provider. If you update one table, update the other in the same change.
38+
39+
Secondary cross-check (use at least one): OpenRouter, Artificial Analysis, CloudPrice.
40+
41+
If a fetch fails (404, timeout, paywall), record the URL attempted and mark dependent fields UNVERIFIED.
42+
43+
## Step 3: Build the consumption map for this provider
44+
45+
Re-grep before trusting the snapshot below:
46+
47+
```bash
48+
rg "reasoningEffort|reasoning_effort" apps/sim/providers/<provider>/
49+
rg "verbosity" apps/sim/providers/<provider>/
50+
rg "request\.thinking|thinking:" apps/sim/providers/<provider>/
51+
rg "supportsNativeStructuredOutputs|nativeStructuredOutputs" apps/sim/providers/<provider>/
52+
```
53+
54+
Snapshot (verify before relying):
55+
56+
| Capability | Consumed by |
57+
|---|---|
58+
| `reasoningEffort` | `openai/core.ts`, `azure-openai`, `anthropic/core.ts` (mapped via thinking), `gemini/core.ts` |
59+
| `verbosity` | `openai/core.ts`, `azure-openai/index.ts` |
60+
| `thinking` | `anthropic/core.ts`, `gemini/core.ts` |
61+
| `nativeStructuredOutputs` | `anthropic/core.ts`, `fireworks/index.ts`, `openrouter/index.ts` |
62+
| `computerUse` | `anthropic/core.ts` |
63+
| `temperature` | All providers (passthrough) |
64+
65+
A flag set in `models.ts` but not in the consumption list for this provider = **warning: dead flag**.
66+
67+
## Step 4: Run the checklist
68+
69+
For each model, evaluate every row. Statuses: ✓ matches docs, ✗ disagrees, ⚠️ single-source, ❓ UNVERIFIED (could not fetch).
70+
71+
### Identity
72+
- [ ] `id` exactly matches provider's API model identifier (case, dots, dashes, prefix for resellers)
73+
- [ ] `releaseDate` matches launch announcement
74+
- [ ] `deprecated: true` set if provider has announced retirement (or removed from active list)
75+
76+
### Pricing (per 1M tokens, USD)
77+
- [ ] `pricing.input` matches provider pricing page
78+
- [ ] `pricing.output` matches provider pricing page
79+
- [ ] `pricing.cachedInput` matches provider's documented cached/prompt-cache rate (or is correctly omitted if no caching offered)
80+
- [ ] `pricing.updatedAt` is recent — warn if older than 60 days
81+
82+
### Context & output limits
83+
- [ ] `contextWindow` matches docs (in tokens)
84+
- [ ] `capabilities.maxOutputTokens` matches documented output cap (or is correctly omitted if "no output limit")
85+
86+
### Capabilities (each must be DOCUMENTED-AS-SUPPORTED **and** CONSUMED-BY-PROVIDER-CODE)
87+
- [ ] `temperature` — provider accepts it for this model (reasoning-always-on models often reject)
88+
- [ ] `reasoningEffort.values` — list matches docs; **omitted** for always-reasoning models that reject the parameter (e.g., grok-4.3, where xAI docs explicitly state `reasoning_effort` is not supported). Verify per model — some always-reasoning models (e.g., OpenAI's o-series) DO accept `reasoning_effort` and should keep the flag.
89+
- [ ] `verbosity.values` — only on OpenAI gpt-5.x family; values match docs
90+
- [ ] `thinking.levels` + `thinking.default` — only on Anthropic/Gemini; values match docs
91+
- [ ] `nativeStructuredOutputs` — only on anthropic/fireworks/openrouter; provider must document Structured Outputs / JSON-mode for this model
92+
- [ ] `toolUsageControl` — provider supports `tool_choice` semantics
93+
- [ ] `computerUse` — provider implements computer-use loop AND model is a computer-use SKU
94+
- [ ] `deepResearch` — only on actual deep-research SKUs
95+
- [ ] `memory: false` — only when the model genuinely cannot maintain conversation history
96+
97+
### Flags
98+
- [ ] `recommended: true` — at most one or two per provider; should be current flagship
99+
- [ ] `speedOptimized: true` — only on smallest/fastest tier (nano / flash-lite / haiku class)
100+
101+
## Step 5: Report (mandatory format)
102+
103+
For each model, emit a table with one row per checklist item. Every row that claims ✓ must have a URL.
104+
105+
```markdown
106+
### Validation — <model-id>
107+
108+
| Field | Repo | Live docs | Source URL | Status |
109+
|---|---|---|---|---|
110+
| `input` | $1.25/M | $1.25/M | https://docs.x.ai/... ||
111+
| `cachedInput` | $0.50/M | $0.20/M | https://cloudprice.net/... | ✗ stale (price cut not picked up) |
112+
| `reasoningEffort` | low/medium/high | rejected by API | https://docs.x.ai/.../reasoning | ✗ inert — selecting silently no-ops |
113+
| `contextWindow` | 1,000,000 | 1,000,000 | https://docs.x.ai/... + https://openrouter.ai/... | ✓ (2 sources) |
114+
| `releaseDate` | 2026-04-30 | not found in scraped pages | _attempted: docs.x.ai, x.ai/news_ | ❓ UNVERIFIED |
115+
116+
**Findings**
117+
- 🔴 critical — `cachedInput` is wrong: docs say $0.20/M, repo has $0.50/M
118+
- 🟡 warning — `reasoningEffort` is set but provider rejects it for this model (xAI docs explicitly: "reasoning_effort is not supported by grok-4.3")
119+
- 🔵 suggestion — `pricing.updatedAt` is 90 days old; refresh
120+
- ❓ unverified — `releaseDate` could not be confirmed from any fetched page; ask user
121+
122+
**Disagreements between sources**
123+
- _none_ OR _OpenRouter says $X, provider docs say $Y — went with provider docs_
124+
```
125+
126+
End each multi-model run with a summary count: `N models checked · X critical · Y warnings · Z suggestions · W unverified`.
127+
128+
## Step 6: Offer to fix
129+
130+
After reporting, ask: *"Want me to fix the critical and warning items? I'll print the diff first."* On yes:
131+
132+
1. Print the proposed diff (do not apply yet)
133+
2. Get user confirmation
134+
3. Edit `models.ts` in a single pass
135+
4. Run `bun run lint`
136+
5. Re-run only the failed rows of the checklist on the new state
137+
138+
## Severity definitions
139+
140+
- 🔴 **critical** — wrong number or wrong identifier that misleads users about cost or breaks API calls. Examples: incorrect pricing, wrong model id, wrong context window, capability the API rejects.
141+
- 🟡 **warning** — dead code or internal inconsistency. Examples: capability flag the provider ignores, multiple `recommended: true` per provider, `pricing.updatedAt` >60 days old, missing `deprecated: true` on retired model.
142+
- 🔵 **suggestion** — style/consistency. Examples: field order, missing `speedOptimized` on a clearly smallest-tier model.
143+
-**unverified** — could not fetch an authoritative source for this field. Surface it; never silently confirm.
144+
145+
## Common bugs this skill catches
146+
147+
- Pricing drift after a provider price cut (very common — providers cut quarterly)
148+
- `reasoningEffort` set on always-reasoning models that reject the parameter (grok-4.3, o3-pro pattern)
149+
- `nativeStructuredOutputs` set on providers that don't consume the flag (dead)
150+
- `thinking` set on non-Anthropic/non-Gemini providers
151+
- `verbosity` set on non-gpt-5.x models
152+
- Wrong context window (e.g., 128k claimed vs 200k actual)
153+
- Stale `pricing.updatedAt`
154+
- Multiple `recommended: true` per provider after a flagship swap
155+
- Missing `deprecated: true` on retired models (e.g., the xAI batch retiring May 15, 2026)
156+
157+
## What "I cannot verify this" looks like
158+
159+
If, after fetching the documented sources, a field cannot be confirmed:
160+
161+
- Mark the row ❓ UNVERIFIED with the URL(s) attempted
162+
- Surface it in the **Findings** section with severity ❓
163+
- Do NOT mark the validation as passed
164+
- Ask the user for a docs URL or guidance before changing anything
165+
166+
The skill is allowed to say *"I could not verify the cached input price for grok-4.3 from the official xAI docs in this session — I attempted [URLs] without finding the value. Third-party sources [URL1, URL2] both report $0.20/M. Confirm before I update."* That is correct behavior. Hallucinating a number is not.

0 commit comments

Comments
 (0)