Skip to content

Commit 9aff383

Browse files
committed
chore(skills): add /add-model and /validate-model commands
1 parent a945399 commit 9aff383

2 files changed

Lines changed: 334 additions & 0 deletions

File tree

.claude/commands/add-model.md

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
---
2+
description: Add a new LLM model to apps/sim/providers/models.ts with specs verified against the provider's live API docs (no hallucination)
3+
argument-hint: <provider> <model-id> [docs-url]
4+
---
5+
6+
# Add Model Skill
7+
8+
You add a new model entry to `apps/sim/providers/models.ts`. **Every numeric and capability claim MUST be derived from a live web fetch of the provider's official docs in this session.** Marketing emails, training data, and your prior knowledge are not sources of truth — they routinely hallucinate pricing, context windows, and capability lists.
9+
10+
## Hard rules (do not skip)
11+
12+
1. **Live-fetch or refuse.** Before writing the entry, you must successfully WebFetch the provider's official models/pricing page in this session. If you cannot reach an authoritative source for any field, **mark the field as UNVERIFIED in your report and ask the user before guessing**. Never fill in pricing or capabilities from memory.
13+
2. **Two-source rule for pricing.** Cross-check input/output/cached pricing against at least one secondary source (OpenRouter, Artificial Analysis, CloudPrice, mem0, intuitionlabs). If sources disagree, the provider's own docs win — but flag the disagreement.
14+
3. **Read the code before setting capability flags.** Capability flags are dead unless the provider's implementation under `apps/sim/providers/{provider}/` actually consumes them (see Consumption Matrix below). Setting a flag the provider ignores is a silent bug.
15+
4. **Cite every fact.** Your final report must list the URL each value came from. No URL → not verified.
16+
17+
## Your Task
18+
19+
1. Identify provider and model id from user args
20+
2. Live-fetch official docs + pricing page + capability/parameter pages + at least one secondary source
21+
3. Apply the Consumption Matrix to know which capability flags are real
22+
4. Read 2-3 sibling entries in `models.ts` and match their pattern exactly
23+
5. Insert the entry, run `bun run lint`, print the verification report
24+
25+
## Step 1: Live source-of-truth lookup
26+
27+
In priority order — fetch all that exist for the provider:
28+
29+
| Provider | Models index | Pricing | Reasoning/parameter caveats |
30+
|---|---|---|---|
31+
| OpenAI | platform.openai.com/docs/models | openai.com/api/pricing | platform.openai.com/docs/guides/reasoning |
32+
| Anthropic | docs.anthropic.com/en/docs/about-claude/models | anthropic.com/pricing | docs.anthropic.com/en/docs/build-with-claude/extended-thinking |
33+
| Google (Gemini) | ai.google.dev/gemini-api/docs/models | ai.google.dev/pricing | ai.google.dev/gemini-api/docs/thinking |
34+
| xAI | docs.x.ai/developers/models | docs.x.ai/developers/models (per-model detail page) | docs.x.ai/developers/model-capabilities/text/reasoning |
35+
| Mistral | docs.mistral.ai/getting-started/models/models_overview | mistral.ai/pricing | n/a |
36+
| DeepSeek | api-docs.deepseek.com/quick_start/pricing | same | api-docs.deepseek.com/guides/reasoning_model |
37+
| Groq | console.groq.com/docs/models | groq.com/pricing | n/a |
38+
| Cerebras | inference-docs.cerebras.ai/models | cerebras.ai/pricing | n/a |
39+
40+
Secondary verification (use at least one): `openrouter.ai/<provider>/<model>`, `artificialanalysis.ai/models/<model>`, `cloudprice.net/models/<provider>-<model>`.
41+
42+
Use a precise WebFetch prompt: *"Extract for {model_id}: exact model id string, context window in tokens, input price per 1M, cached input price per 1M, output price per 1M, max output tokens, supported reasoning effort levels, accepted parameters (temperature, top_p), release date. Do not fill in fields you cannot find."*
43+
44+
## Step 2: Consumption Matrix (which provider honors which capability)
45+
46+
| Capability | Honored by | Effect if set elsewhere |
47+
|---|---|---|
48+
| `temperature` | All providers (passed through if set) | Safe but inert on always-reasoning models that reject it |
49+
| `toolUsageControl` | All providers (provider-level, not per-model) | n/a — set on `ProviderDefinition`, not models |
50+
| `reasoningEffort` | `openai/core.ts`, `azure-openai`, `anthropic/core.ts` (mapped to thinking), `gemini/core.ts` | **Dead on xai, deepseek, mistral, groq, cerebras, openrouter, fireworks, bedrock, vertex** unless their core consumes it — re-grep before assuming |
51+
| `verbosity` | `openai/core.ts`, `azure-openai/index.ts` only | Dead elsewhere |
52+
| `thinking` | `anthropic/core.ts`, `gemini/core.ts` | Dead elsewhere |
53+
| `nativeStructuredOutputs` | `anthropic/core.ts`, `fireworks/index.ts`, `openrouter/index.ts` | Dead on openai, xai, google, vertex, bedrock, azure-openai, deepseek, mistral, groq, cerebras |
54+
| `maxOutputTokens` | Read by UI + executor for token estimation | Always meaningful — set if provider documents a cap |
55+
| `computerUse` | `anthropic/core.ts` | Dead elsewhere |
56+
| `deepResearch` | UI flag for routing to deep-research SKUs | Set only on actual deep-research model IDs |
57+
| `memory: false` | Conversation persistence opt-out | Set only when model genuinely cannot maintain history (e.g., deep-research) |
58+
59+
**Always re-grep before relying on this table** — the codebase moves:
60+
61+
```bash
62+
rg "reasoningEffort|reasoning_effort" apps/sim/providers/<provider>/
63+
rg "verbosity" apps/sim/providers/<provider>/
64+
rg "request\.thinking|thinking:" apps/sim/providers/<provider>/
65+
rg "supportsNativeStructuredOutputs|nativeStructuredOutputs" apps/sim/providers/<provider>/
66+
```
67+
68+
## Step 3: Match the provider's existing entry pattern
69+
70+
Open `apps/sim/providers/models.ts`, find `PROVIDER_DEFINITIONS[<provider>].models`, read 2-3 sibling entries. Match field order exactly:
71+
72+
```ts
73+
{
74+
id: '<exact-api-id>',
75+
pricing: {
76+
input: <number>,
77+
cachedInput: <number>, // omit if provider doesn't offer caching
78+
output: <number>,
79+
updatedAt: '<today YYYY-MM-DD>',
80+
},
81+
capabilities: {
82+
// only flags the provider actually consumes — see matrix
83+
},
84+
contextWindow: <tokens>,
85+
releaseDate: '<YYYY-MM-DD>',
86+
recommended: true, // only if new flagship; ask user before swapping
87+
speedOptimized: true, // only on smallest/fastest tier
88+
deprecated: true, // only on retired models
89+
}
90+
```
91+
92+
### Reseller providers (azure-openai, azure-anthropic, vertex, bedrock, openrouter)
93+
94+
Model id MUST be prefixed: `azure/`, `azure-anthropic/`, `vertex/`, `bedrock/`, `openrouter/`. Pricing usually mirrors the upstream provider but verify on the reseller's own pricing page.
95+
96+
### Insertion order
97+
98+
Within a family, newest first (matches existing convention: GPT-5.5 above GPT-5.4 above GPT-5.2). Across families, biggest/flagship at top of list.
99+
100+
### `recommended` / `speedOptimized`
101+
102+
- At most one or two `recommended: true` per provider — the current flagship(s).
103+
- If you're adding a new flagship, ask the user before removing `recommended` from the previous flagship. Never silently flip it.
104+
- `speedOptimized: true` only on the smallest/fastest tier (nano, flash-lite, haiku class).
105+
106+
## Step 4: Write, lint
107+
108+
```bash
109+
bun run lint
110+
```
111+
112+
Lint must pass before reporting done.
113+
114+
## Step 5: Verification report (mandatory format)
115+
116+
End with this exact structure:
117+
118+
```markdown
119+
### Verification — <model-id>
120+
121+
| Field | Value | Source URL | Status |
122+
|---|---|---|---|
123+
| `id` | `grok-4.3` | https://docs.x.ai/... | ✓ verified |
124+
| `contextWindow` | 1,000,000 | https://docs.x.ai/... + https://openrouter.ai/... | ✓ verified (2 sources agree) |
125+
| `input` | $1.25/M | https://docs.x.ai/... | ✓ verified |
126+
| `cachedInput` | $0.20/M | https://cloudprice.net/... | ⚠️ single source |
127+
| `output` | $2.50/M | https://docs.x.ai/... + https://openrouter.ai/... | ✓ verified |
128+
| `capabilities.temperature` | `{ min: 0, max: 1 }` | matches sibling entries | — pattern-match only |
129+
| `capabilities.reasoningEffort` | NOT SET | provider docs say API rejects it for this model | ✓ correctly omitted |
130+
| `releaseDate` | 2026-04-30 | https://docs.x.ai/... announcement | ✓ verified |
131+
132+
**Disagreements**
133+
- _none_ OR _OpenRouter says X, provider docs say Y — used Y per provider rule_
134+
135+
**Unverified fields**
136+
- _none_ OR _<field>: could not find authoritative source — left as <X> based on sibling pattern; please confirm_
137+
```
138+
139+
If any row is ⚠️ single-source or "unverified," **state it plainly to the user and ask whether to proceed**. Do not silently merge.
140+
141+
## What to do if you cannot find a source
142+
143+
- Pricing missing → do NOT guess. Omit `cachedInput`. Ask the user for the price or the docs URL.
144+
- Context window missing → do NOT guess. Ask.
145+
- Release date missing → omit the field; mention in the report.
146+
- Capability uncertain → omit the flag (safer than setting a dead/wrong one).
147+
148+
## Anti-patterns this skill exists to prevent
149+
150+
- ❌ Trusting a marketing email (xAI's grok-4.3 email claimed "3 reasoning efforts" but the API rejects `reasoning_effort` — verified by official docs only)
151+
- ❌ Setting `nativeStructuredOutputs: true` on xai/openai/google (dead — only anthropic/fireworks/openrouter consume it)
152+
- ❌ Setting `thinking` on non-Anthropic/non-Gemini providers
153+
- ❌ Setting `verbosity` on anything other than OpenAI gpt-5.x
154+
- ❌ Copying `pricing.updatedAt` from a sibling instead of using today's date
155+
- ❌ Inventing a `cachedInput` price by dividing input by 4 (varies by provider — find an explicit number)
156+
- ❌ Stamping `recommended: true` on the new model without removing it from the previous flagship
157+
- ❌ Reporting "done" with any UNVERIFIED row in the table

.claude/commands/validate-model.md

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
---
2+
description: Validate a model entry (or every model in a provider) in apps/sim/providers/models.ts against the provider's live API docs (no hallucination — reports what cannot be verified)
3+
argument-hint: <provider> [model-id]
4+
---
5+
6+
# Validate Model Skill
7+
8+
You audit one or more model entries in `apps/sim/providers/models.ts` against the provider's official live API docs. **Hallucinated pricing and capabilities are the #1 failure mode in this file.** Every numeric and capability claim must be re-derived from a live web fetch in this session — not from memory, not from training data, not from the user's marketing email.
9+
10+
## Hard rules (do not skip)
11+
12+
1. **Live-fetch or report unverified.** Each field must be backed by a live WebFetch in this session. If you cannot reach an authoritative URL for a field, mark it **UNVERIFIED** in the report — do not silently confirm it from memory.
13+
2. **Cite every fact.** Every value in the report must show the source URL it was checked against. No URL → mark UNVERIFIED.
14+
3. **Two-source rule for pricing.** Cross-check input/output/cached against at least one secondary source (OpenRouter, Artificial Analysis, CloudPrice). If sources disagree, the provider's own docs win — flag the disagreement.
15+
4. **Inspect provider implementation before flagging capability mismatches.** A capability flag in `models.ts` is dead unless the provider's code under `apps/sim/providers/{provider}/` consumes it (see Consumption Matrix below). Setting a flag the provider ignores is a warning, not a critical.
16+
5. **Never auto-fix without printing the diff.** Show the user the proposed diff before applying. Get confirmation.
17+
18+
## Your Task
19+
20+
When invoked as `/validate-model <provider> [model-id]`:
21+
22+
1. Read the target entries from `models.ts`
23+
2. Live-fetch the provider's official models, pricing, and capability/reasoning pages + at least one secondary source for pricing
24+
3. Inspect the provider implementation to know which flags are actually consumed
25+
4. Run the checklist below per model
26+
5. Report findings (critical / warning / suggestion / unverified) with every cell linked to its source URL
27+
6. Offer to fix; on confirm, edit `models.ts` in a single pass and re-lint
28+
29+
If `model-id` is omitted, validate every model in the provider.
30+
31+
## Step 1: Read entries from `models.ts`
32+
33+
Capture per model: `id`, full `pricing`, full `capabilities`, `contextWindow`, `releaseDate`, `recommended`, `speedOptimized`, `deprecated`.
34+
35+
## Step 2: Live-fetch authoritative sources
36+
37+
Same provider table as `add-model`. Fetch all relevant pages:
38+
39+
| Provider | Models index | Pricing | Reasoning/parameter caveats |
40+
|---|---|---|---|
41+
| OpenAI | platform.openai.com/docs/models | openai.com/api/pricing | platform.openai.com/docs/guides/reasoning |
42+
| Anthropic | docs.anthropic.com/en/docs/about-claude/models | anthropic.com/pricing | docs.anthropic.com/en/docs/build-with-claude/extended-thinking |
43+
| Google (Gemini) | ai.google.dev/gemini-api/docs/models | ai.google.dev/pricing | ai.google.dev/gemini-api/docs/thinking |
44+
| xAI | docs.x.ai/developers/models | per-model detail page on docs.x.ai | docs.x.ai/developers/model-capabilities/text/reasoning |
45+
| Mistral | docs.mistral.ai/getting-started/models/models_overview | mistral.ai/pricing | n/a |
46+
| DeepSeek | api-docs.deepseek.com/quick_start/pricing | same | api-docs.deepseek.com/guides/reasoning_model |
47+
| Groq | console.groq.com/docs/models | groq.com/pricing | n/a |
48+
| Cerebras | inference-docs.cerebras.ai/models | cerebras.ai/pricing | n/a |
49+
50+
Secondary cross-check (use at least one): OpenRouter, Artificial Analysis, CloudPrice.
51+
52+
If a fetch fails (404, timeout, paywall), record the URL attempted and mark dependent fields UNVERIFIED.
53+
54+
## Step 3: Build the consumption map for this provider
55+
56+
Re-grep before trusting the snapshot below:
57+
58+
```bash
59+
rg "reasoningEffort|reasoning_effort" apps/sim/providers/<provider>/
60+
rg "verbosity" apps/sim/providers/<provider>/
61+
rg "request\.thinking|thinking:" apps/sim/providers/<provider>/
62+
rg "supportsNativeStructuredOutputs|nativeStructuredOutputs" apps/sim/providers/<provider>/
63+
```
64+
65+
Snapshot (verify before relying):
66+
67+
| Capability | Consumed by |
68+
|---|---|
69+
| `reasoningEffort` | `openai/core.ts`, `azure-openai`, `anthropic/core.ts` (mapped via thinking), `gemini/core.ts` |
70+
| `verbosity` | `openai/core.ts`, `azure-openai/index.ts` |
71+
| `thinking` | `anthropic/core.ts`, `gemini/core.ts` |
72+
| `nativeStructuredOutputs` | `anthropic/core.ts`, `fireworks/index.ts`, `openrouter/index.ts` |
73+
| `computerUse` | `anthropic/core.ts` |
74+
| `temperature` | All providers (passthrough) |
75+
76+
A flag set in `models.ts` but not in the consumption list for this provider = **warning: dead flag**.
77+
78+
## Step 4: Run the checklist
79+
80+
For each model, evaluate every row. Statuses: ✓ matches docs, ✗ disagrees, ⚠️ single-source, ❓ UNVERIFIED (could not fetch).
81+
82+
### Identity
83+
- [ ] `id` exactly matches provider's API model identifier (case, dots, dashes, prefix for resellers)
84+
- [ ] `releaseDate` matches launch announcement
85+
- [ ] `deprecated: true` set if provider has announced retirement (or removed from active list)
86+
87+
### Pricing (per 1M tokens, USD)
88+
- [ ] `pricing.input` matches provider pricing page
89+
- [ ] `pricing.output` matches provider pricing page
90+
- [ ] `pricing.cachedInput` matches provider's documented cached/prompt-cache rate (or is correctly omitted if no caching offered)
91+
- [ ] `pricing.updatedAt` is recent — warn if older than 60 days
92+
93+
### Context & output limits
94+
- [ ] `contextWindow` matches docs (in tokens)
95+
- [ ] `capabilities.maxOutputTokens` matches documented output cap (or is correctly omitted if "no output limit")
96+
97+
### Capabilities (each must be DOCUMENTED-AS-SUPPORTED **and** CONSUMED-BY-PROVIDER-CODE)
98+
- [ ] `temperature` — provider accepts it for this model (reasoning-always-on models often reject)
99+
- [ ] `reasoningEffort.values` — list matches docs; **omitted** for always-reasoning models that reject the parameter (grok-4.3, o3-pro pattern)
100+
- [ ] `verbosity.values` — only on OpenAI gpt-5.x family; values match docs
101+
- [ ] `thinking.levels` + `thinking.default` — only on Anthropic/Gemini; values match docs
102+
- [ ] `nativeStructuredOutputs` — only on anthropic/fireworks/openrouter; provider must document Structured Outputs / JSON-mode for this model
103+
- [ ] `toolUsageControl` — provider supports `tool_choice` semantics
104+
- [ ] `computerUse` — provider implements computer-use loop AND model is a computer-use SKU
105+
- [ ] `deepResearch` — only on actual deep-research SKUs
106+
- [ ] `memory: false` — only when the model genuinely cannot maintain conversation history
107+
108+
### Flags
109+
- [ ] `recommended: true` — at most one or two per provider; should be current flagship
110+
- [ ] `speedOptimized: true` — only on smallest/fastest tier (nano / flash-lite / haiku class)
111+
112+
## Step 5: Report (mandatory format)
113+
114+
For each model, emit a table with one row per checklist item. Every row that claims ✓ must have a URL.
115+
116+
```markdown
117+
### Validation — <model-id>
118+
119+
| Field | Repo | Live docs | Source URL | Status |
120+
|---|---|---|---|---|
121+
| `input` | $1.25/M | $1.25/M | https://docs.x.ai/... ||
122+
| `cachedInput` | $0.50/M | $0.20/M | https://cloudprice.net/... | ✗ stale (price cut not picked up) |
123+
| `reasoningEffort` | low/medium/high | rejected by API | https://docs.x.ai/.../reasoning | ✗ inert — selecting silently no-ops |
124+
| `contextWindow` | 1,000,000 | 1,000,000 | https://docs.x.ai/... + https://openrouter.ai/... | ✓ (2 sources) |
125+
| `releaseDate` | 2026-04-30 | not found in scraped pages | _attempted: docs.x.ai, x.ai/news_ | ❓ UNVERIFIED |
126+
127+
**Findings**
128+
- 🔴 critical — `cachedInput` is wrong: docs say $0.20/M, repo has $0.50/M
129+
- 🟡 warning — `reasoningEffort` is set but provider rejects it for this model (xAI docs explicitly: "reasoning_effort is not supported by grok-4.3")
130+
- 🔵 suggestion — `pricing.updatedAt` is 90 days old; refresh
131+
- ❓ unverified — `releaseDate` could not be confirmed from any fetched page; ask user
132+
133+
**Disagreements between sources**
134+
- _none_ OR _OpenRouter says $X, provider docs say $Y — went with provider docs_
135+
```
136+
137+
End each multi-model run with a summary count: `N models checked · X critical · Y warnings · Z suggestions · W unverified`.
138+
139+
## Step 6: Offer to fix
140+
141+
After reporting, ask: *"Want me to fix the critical and warning items? I'll print the diff first."* On yes:
142+
143+
1. Print the proposed diff (do not apply yet)
144+
2. Get user confirmation
145+
3. Edit `models.ts` in a single pass
146+
4. Run `bun run lint`
147+
5. Re-run only the failed rows of the checklist on the new state
148+
149+
## Severity definitions
150+
151+
- 🔴 **critical** — wrong number or wrong identifier that misleads users about cost or breaks API calls. Examples: incorrect pricing, wrong model id, wrong context window, capability the API rejects.
152+
- 🟡 **warning** — dead code or internal inconsistency. Examples: capability flag the provider ignores, multiple `recommended: true` per provider, `pricing.updatedAt` >60 days old, missing `deprecated: true` on retired model.
153+
- 🔵 **suggestion** — style/consistency. Examples: field order, missing `speedOptimized` on a clearly smallest-tier model.
154+
-**unverified** — could not fetch an authoritative source for this field. Surface it; never silently confirm.
155+
156+
## Common bugs this skill catches
157+
158+
- Pricing drift after a provider price cut (very common — providers cut quarterly)
159+
- `reasoningEffort` set on always-reasoning models that reject the parameter (grok-4.3, o3-pro pattern)
160+
- `nativeStructuredOutputs` set on providers that don't consume the flag (dead)
161+
- `thinking` set on non-Anthropic/non-Gemini providers
162+
- `verbosity` set on non-gpt-5.x models
163+
- Wrong context window (e.g., 128k claimed vs 200k actual)
164+
- Stale `pricing.updatedAt`
165+
- Multiple `recommended: true` per provider after a flagship swap
166+
- Missing `deprecated: true` on retired models (e.g., the xAI batch retiring May 15, 2026)
167+
168+
## What "I cannot verify this" looks like
169+
170+
If, after fetching the documented sources, a field cannot be confirmed:
171+
172+
- Mark the row ❓ UNVERIFIED with the URL(s) attempted
173+
- Surface it in the **Findings** section with severity ❓
174+
- Do NOT mark the validation as passed
175+
- Ask the user for a docs URL or guidance before changing anything
176+
177+
The skill is allowed to say *"I could not verify the cached input price for grok-4.3 from the official xAI docs in this session — I attempted [URLs] without finding the value. Third-party sources [URL1, URL2] both report $0.20/M. Confirm before I update."* That is correct behavior. Hallucinating a number is not.

0 commit comments

Comments
 (0)