Skip to content

Merge colleague evals into scenario_design pipeline#498

Open
kcarnold wants to merge 3 commits into
claude/dreamy-noether-sh20l9from
merge-colleague-evals
Open

Merge colleague evals into scenario_design pipeline#498
kcarnold wants to merge 3 commits into
claude/dreamy-noether-sh20l9from
merge-colleague-evals

Conversation

@kcarnold

Copy link
Copy Markdown
Contributor

What & why

The experiment app had two independently-developed systems validating the same colleague-LLM behavior against the same scenarios (confirmed by git history): the older evalColleague.ts + colleagueEval.ts (5 hardcoded criteria, single-turn probes, gpt-4o-mini judge) and the newer scripts/scenario_design/ pipeline (8 criteria in criteria.md, multi-turn archetype sims, gpt-4o judge). This consolidates onto the pipeline, with criteria.md as the single source of truth.

Changes

  • probe.ts (new) — folds the old single-turn adversarial probes in as a pipeline phase. GENERIC_PROBES (scenario-agnostic) live in code; scenario-specific fact questions live in scenarios.json under chat.probes. Each probe is judged against only its mapped criteria (criteria.md slugs), reusing the pipeline's judge.
  • Colleague model + reasoning effort are now scenario config (chat.model / chat.reasoningEffort, default gpt-5.5 / low), read by both the live app/api/chat/route.ts and the pipeline, so production and eval test the same thing. (Previously hardcoded gpt-5.2.)
  • Latency — every colleague turn records latencyMs + reasoningTokens; probes fail if a response exceeds API_TIMEOUT_MS (the live app's existing 20s abort).
  • Deleted evalColleague.ts, colleagueEval.ts, and fix.ts (the latter replaced by pointing a coding agent at the agent-readable *_judgments.json).
  • Exported shared helpers from simulate.ts/judge.ts + guarded their main() so importing them doesn't execute them.
  • Docs: new scripts/scenario_design/README.md + updated experiment/CLAUDE.md.

Verification

  • npm run typecheck + lint clean; no dangling references.
  • Live probe run on roomDoubleBooking: 6/7 pass, latencies 1.4–2.9s (gpt-5.5/low is well under the 20s budget — no fallback to gpt-5.4-mini needed).
  • The one failing probe is a real finding: on "Anything else I should know?" the colleague volunteers logistical details (Information Gating violation) — a scenario system-prompt issue to address separately.

Note

This changes the live colleague model from gpt-5.2gpt-5.5 (intended).

Follow-ups

🤖 Generated with Claude Code

kcarnold and others added 3 commits June 24, 2026 21:39
Consolidate the two independently-developed colleague-behavior eval systems
onto the scenario_design pipeline with criteria.md as the single source of
truth.

- Add probe.ts: folds in the old single-turn adversarial probes as a pipeline
  phase. Generic probes live in code; scenario-specific ones in scenarios.json
  (chat.probes). Each probe is judged against only its mapped criteria.
- Delete the superseded evalColleague.ts + colleagueEval.ts (duplicate criteria,
  weaker gpt-4o-mini judge) and fix.ts (replaced by pointing a coding agent at
  the agent-readable *_judgments.json).
- Move the colleague model + reasoning effort into the scenario config
  (chat.model / chat.reasoningEffort, default gpt-5.5/low), read by both the
  live chat route and the pipeline so they test the same thing.
- Record colleague latency (and reasoning tokens) per turn; probes fail if a
  response exceeds API_TIMEOUT_MS (the live app's 20s abort budget).
- Export shared helpers from simulate.ts/judge.ts and guard their main() so
  importing them doesn't trigger execution.
- Add scripts/scenario_design/README.md and update experiment/CLAUDE.md.

Follow-ups: #495 (typing indicator during generation), #496 (longer-context
probes), #497 (parameterize remaining models).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kcarnold kcarnold changed the base branch from main to claude/dreamy-noether-sh20l9 June 25, 2026 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant