-
Notifications
You must be signed in to change notification settings - Fork 102
API Arbitration
How we resolve API design disputes with data instead of opinion. This is the operational process for choosing between API shapes when the right answer isn't obvious.
Not every API decision needs arbitration. Use this process when:
- You're choosing between 2–3 viable API shapes and can argue for any of them
- Naming isn't obvious (it rarely is)
- The "right" answer depends on how someone will actually reach for it in practice
- Two people disagree and both have reasonable arguments
- A spec review surfaces uncertainty that can't be resolved by checking conventions alone
Don't use this for:
- Decisions already covered by API Conventions (just follow the convention)
- Bug fixes or implementation details that don't change the public API
- Cases where only one option actually works (no arbitration needed — ship it)
Four phases, in order. Don't skip phases — each one feeds the next.
Enumerate the realistic design space. Don't anchor on two options prematurely — consider the full range:
| Dimension | Options to explore |
|---|---|
| Abstraction level | Hook vs prop vs wrapper component vs context provider |
| Naming | What would a naive developer search for? What mental models does each name trigger? |
| Configuration | Plain boolean vs boolean | config object vs separate props vs children-based |
| Ownership | State on parent (hoisted) vs state on child (local) vs shared context |
| Composition | Slot on parent vs standalone child vs render prop vs compound components |
| Granularity | One component with modes vs separate components per mode |
For each viable option, write out what consumer code looks like — not the implementation, the usage. Write at least 3 realistic usage snippets. If you can't write 3, the option probably isn't viable.
Output: 2–4 candidate API shapes, each with consumer code examples.
Before testing, exhaustively list the scenarios the API must handle. This is the step that prevents you from only testing the happy path and declaring victory.
Every API arbitration should cover at minimum:
| Case | What it tests |
|---|---|
| Simple/default | The 80% use case. Zero config. Does it just work? |
| Configured | Custom options beyond defaults. Does configuration feel natural? |
| Controlled | External state management. Can the consumer own the state? |
| Composed | Inside Dialog, Table, AppShell, Card. Does it play well with siblings? |
| Edge/mixed | The scenario that reveals friction — mixed modes, dynamic switching, responsive changes |
| Migration | What does adopting this look like for someone with nothing today? |
Add domain-specific cases as needed. For resize: "two resizable panels side by side." For a selector: "1000 items with search." For navigation: "mobile drawer vs desktop sidebar from the same source."
Output: A numbered list of 5–10 specific scenarios to test against.
This is where the vibe test methodology applies. The core constraint: prompts describe desired UX, never name components or props.
| ❌ Bad (names the solution) | ✅ Good (describes the experience) |
|---|---|
| "Make the sidebar resizable using useResizable" | "Build a layout where the user can drag the sidebar edge to make it wider or narrower" |
| "Add a collapsible prop to the card" | "Build a FAQ page where each question can be expanded to show its answer" |
| "Use Selector with async loading" | "Build a people picker that searches a remote API as the user types" |
Each prompt should map to one of your Phase 2 use cases. Write one prompt per case minimum.
For each (prompt × API option):
-
Write a minimal skill doc for that option (200–400 words). Document ONLY that API shape — component name, props with types, 2 usage examples, one anti-pattern. Keep all docs the same length and structure. The only variable is the API itself.
-
Generate code as if the skill doc is the only reference. The generator must not know about the other options. See Vibe Evaluation#Sub-Agent Isolation for how to enforce this with sub-agents. If running single-model (acknowledged contamination), note where cross-knowledge may have influenced results.
-
Collect the output — the generated code, what the agent reached for, where it hesitated, what it hallucinated.
| Signal | What it means |
|---|---|
| Agent finds the API immediately | Good discoverability — the name matches the mental model |
| Agent halluccinates props that don't exist | The name triggered associations with another library's API |
| Agent transforms option A into option B | Option B is how people actually think about the problem |
| Agent adds wrapper divs or extra state | The API has a composition gap |
| Agent uses the API correctly but code is verbose | The abstraction level might be wrong |
| Agent ignores a feature entirely | The feature isn't discoverable from the docs |
| All options produce identical code | The difference doesn't matter — pick the simpler one |
Output: Raw results for each (prompt × option) — generated code, observations, escape hatches used.
The first round rarely produces a clear winner. This phase is iterative.
| Observation | Action |
|---|---|
| One option wins on 4/5 prompts, ties on 1 | Ship the winner. The tie doesn't matter. |
| Options split — each wins on different cases | The abstraction is wrong. Neither is the answer. Revisit Phase 1. |
| One option triggers consistent hallucinations | The naming conflicts with prior art (Radix, shadcn, MUI). Rename or restructure. |
| LLM transforms one option into the other | The "losing" option is how people think. The "winning" option is how the system works. Consider shipping the mental-model version. |
| Neither option handles the edge case | The edge case reveals a gap. Add it to the spec. May need a new approach entirely. |
| Results are inconclusive / too close to call | Tie-break on: fewer props > fewer concepts > fewer characters > matches existing Astryx patterns. |
If no clear winner emerges:
- Narrow the question. Maybe the difference only matters for one specific case. Test that case in isolation with tighter prompts.
- Blend approaches. The winner on simple cases + the winner on advanced cases might be combinable (e.g., prop for simple, hook for advanced).
- Challenge the premise. If three rounds of testing don't resolve it, the component might be trying to do too much. Consider splitting.
Output: A decision with evidence. Document which option won, on which prompts, and why. This becomes part of the spec.
Copy this template into your AI assistant to run an API arbitration. Fill in the bracketed sections.
# API Arbitration: [Component/Feature Name]
## Context
I'm designing the API for [brief description — what the feature does and
why it exists]. We need to choose between [N] candidate approaches.
## Candidate APIs
### Option A: [Short name — e.g. "wrapper component"]
```tsx
// Simple case
[consumer code]
// Configured case
[consumer code]
// Controlled case
[consumer code]
```
### Option B: [Short name — e.g. "hook with spread"]
```tsx
// Simple case
[consumer code]
// Configured case
[consumer code]
// Controlled case
[consumer code]
```
### Option C: [Short name — e.g. "boolean-or-config prop"] (if applicable)
```tsx
[same structure]
```
## Use Cases
1. [Simple/default — describe the 80% scenario]
2. [Configured — describe customization beyond defaults]
3. [Controlled — describe external state ownership]
4. [Composed — describe usage inside a parent like AppShell or Dialog]
5. [Edge case — describe the scenario that reveals friction]
## Instructions
For each candidate API, do the following:
**Step 1: Write a skill doc** (200-400 words) that documents ONLY that
option. Include:
- Component/hook name and import path
- Props/parameters with types and defaults
- 2 usage examples (simple + configured)
- 1 anti-pattern ("don't do this because...")
Keep all docs the same length. The only variable is the API shape.
**Step 2: For each use case, write a naive prompt** that describes the
desired UX without naming any components or props. These prompts must be
identical across all options.
**Step 3: Generate code** for each (prompt × option) as if the skill doc
is your only reference. For each generation:
- Note what you reached for first
- Note where you hesitated or re-read the doc
- Note any props/components you wanted to use but couldn't find
- Flag if prior knowledge of another option influenced you
**Step 4: Evaluate** each result:
- Hallucinations (props/components that don't exist in the skill doc)
- Lines of code and boilerplate ratio
- Escape hatches (dropping out of the system to raw CSS/HTML)
- Would the code survive a new use case without rewriting?
- Does someone unfamiliar with the API understand the code on first read?
**Step 5: Synthesize**
- Which option had the lowest friction across all prompts?
- Which triggered the most hallucinations?
- Did any prompt reveal a fundamental limitation?
- Recommendation with specific evidence from the results.
## Contamination Note
If running this as a single agent (not isolated sub-agents), results are
biased by cross-knowledge between approaches. Relative signal is still
useful — but for high-stakes decisions, re-run with isolated agents per
[[Vibe Evaluation#Sub-Agent Isolation]].This is a real decision from Astryx development that followed this process.
| Option | Shape | Consumer code |
|---|---|---|
| A: Wrapper | Component wraps target | <Resizable defaultWidth={260}><SideNav /></Resizable> |
| B: Hook | Returns props to spread | <SideNav {...useResizable({ defaultWidth: 260 })} /> |
| C: Prop | Config on the target | <SideNav resizable={{ defaultWidth: 260, onWidthChange }} /> |
- Simple sidebar resize (drag to widen/narrow, default constraints)
- Resize with min/max bounds and width persistence
- Resize with collapse — drag past minimum triggers collapse animation
- Resize in a constrained layout (AppShell header + sidebar)
- Two resizable panels sharing available space
Prompts were written for each case. Key findings:
Option A (Wrapper):
- Added an extra DOM node in every case — unavoidable structural cost
- Two-panel case required nested wrappers with confusing ordering
- LLMs understood the API immediately (familiar pattern from libraries like react-resizable)
- But generated code had wrapper-ordering bugs in 2/5 cases
Option B (Hook):
- Composed cleanly with no extra DOM
- Required understanding "spread props" — some LLMs produced
{...resize}in the wrong position - Handle placement was ambiguous: does the hook add the handle, or does the consumer?
- Lowest boilerplate for the simple case
Option C (Prop):
- Most discoverable — LLMs found
resizableas a prop without any hesitation - Config object felt natural for the configured/controlled cases
- Limitation: only works on components that explicitly accept the prop
- Cleanest generated code across all 5 prompts
No single option won all cases. The resolution:
-
Ship C (prop) for
SideNav— the most common resize target, and the prop covers all real scenarios that came up - Ship B (hook) as the general primitive — any element can be made resizable
- Drop A (wrapper) — DOM bloat for zero benefit in any tested scenario
The prop version calls the hook internally. One implementation, two API surfaces matching different use cases. Documented in API Conventions#Behaviors: Hooks Over Wrappers.
Key insight: The prop was validated by stress-testing all use cases against it. If any case had required the hook's flexibility, we would have shipped hook-only. The prop exists because the common cases didn't need that flexibility — not because props are inherently better than hooks.
For high-confidence results (new components, contentious decisions), use isolated sub-agents instead of single-model evaluation. See Vibe Evaluation#Sub-Agent Isolation for the full methodology and contamination risks.
The short version:
- One agent per (prompt × option) — 5 prompts × 3 options = 15 agents
- Each agent gets ONLY its option's skill doc. No knowledge of alternatives.
- Include the blank-slate constraint: "You have NO prior knowledge of any design system."
- After all agents complete, spawn a judge agent that sees all outputs side-by-side and scores comparatively.
Single-model evaluation (one agent doing all options sequentially) is fine for quick directional signal, early exploration, or cases where the decision isn't high-stakes. Just acknowledge the contamination.
Every API arbitration that ships a result should be documented:
- In the spec issue — link to the vibe test results, note which option won and why
- In the API Conventions wiki — if the decision establishes a new convention (like "hooks over wrappers"), add it
-
In the component's doc — the
{Name}.doc.mjsshould reference the decision if consumers might wonder "why is it this way?"
This creates a trail. When someone proposes changing a decided API, they can see the evidence behind the current choice — and either accept it or propose a re-test with new evidence.
- API Conventions — The documented conventions that API review checks against
- Vibe Tests — Full evaluation methodology (comparative harness, scoring, reports)
- Vibe Evaluation#Sub-Agent Isolation — How to prevent contamination in testing
- Vibe Evaluation#Judge Agent Evaluation — Comparative scoring by a dedicated judge
- Component Specification Protocol — Where arbitration fits in the spec process (Phase 8)
- Contributing with AI Assistants — How contributors encounter this process
- Agent Init Prompt Vibe Testing — Related but different: testing the CLI init prompt, not component APIs