Merge colleague evals into scenario_design pipeline by kcarnold · Pull Request #498 · AIToolsLab/writing-tools

kcarnold · 2026-06-25T01:45:47Z

What & why

The experiment app had two independently-developed systems validating the same colleague-LLM behavior against the same scenarios (confirmed by git history): the older evalColleague.ts + colleagueEval.ts (5 hardcoded criteria, single-turn probes, gpt-4o-mini judge) and the newer scripts/scenario_design/ pipeline (8 criteria in criteria.md, multi-turn archetype sims, gpt-4o judge). This consolidates onto the pipeline, with criteria.md as the single source of truth.

Changes

probe.ts (new) — folds the old single-turn adversarial probes in as a pipeline phase. GENERIC_PROBES (scenario-agnostic) live in code; scenario-specific fact questions live in scenarios.json under chat.probes. Each probe is judged against only its mapped criteria (criteria.md slugs), reusing the pipeline's judge.
Colleague model + reasoning effort are now scenario config (chat.model / chat.reasoningEffort, default gpt-5.5 / low), read by both the live app/api/chat/route.ts and the pipeline, so production and eval test the same thing. (Previously hardcoded gpt-5.2.)
Latency — every colleague turn records latencyMs + reasoningTokens; probes fail if a response exceeds API_TIMEOUT_MS (the live app's existing 20s abort).
Deleted evalColleague.ts, colleagueEval.ts, and fix.ts (the latter replaced by pointing a coding agent at the agent-readable *_judgments.json).
Exported shared helpers from simulate.ts/judge.ts + guarded their main() so importing them doesn't execute them.
Docs: new scripts/scenario_design/README.md + updated experiment/CLAUDE.md.

Verification

npm run typecheck + lint clean; no dangling references.
Live probe run on roomDoubleBooking: 6/7 pass, latencies 1.4–2.9s (gpt-5.5/low is well under the 20s budget — no fallback to gpt-5.4-mini needed).
The one failing probe is a real finding: on "Anything else I should know?" the colleague volunteers logistical details (Information Gating violation) — a scenario system-prompt issue to address separately.

Note

This changes the live colleague model from gpt-5.2 → gpt-5.5 (intended).

Follow-ups

Colleague chat: show typing indicator while the LLM is still generating #495 — typing indicator while the LLM is still generating
Scenario validation: run probes inside a longer conversation (context depth) #496 — run probes inside a longer conversation (context depth)
Make remaining LLM models configurable (writing-support route, judge/participant) #497 — parameterize remaining hardcoded models (writing-support, judge/participant)

🤖 Generated with Claude Code

Consolidate the two independently-developed colleague-behavior eval systems onto the scenario_design pipeline with criteria.md as the single source of truth. - Add probe.ts: folds in the old single-turn adversarial probes as a pipeline phase. Generic probes live in code; scenario-specific ones in scenarios.json (chat.probes). Each probe is judged against only its mapped criteria. - Delete the superseded evalColleague.ts + colleagueEval.ts (duplicate criteria, weaker gpt-4o-mini judge) and fix.ts (replaced by pointing a coding agent at the agent-readable *_judgments.json). - Move the colleague model + reasoning effort into the scenario config (chat.model / chat.reasoningEffort, default gpt-5.5/low), read by both the live chat route and the pipeline so they test the same thing. - Record colleague latency (and reasoning tokens) per turn; probes fail if a response exceeds API_TIMEOUT_MS (the live app's 20s abort budget). - Export shared helpers from simulate.ts/judge.ts and guard their main() so importing them doesn't trigger execution. - Add scripts/scenario_design/README.md and update experiment/CLAUDE.md. Follow-ups: #495 (typing indicator during generation), #496 (longer-context probes), #497 (parameterize remaining models). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…to merge-colleague-evals

kcarnold and others added 3 commits June 24, 2026 21:39

Allow Claude to view backlog tasks

f154661

Merge remote-tracking branch 'origin/claude/dreamy-noether-sh20l9' in…

538a7cb

…to merge-colleague-evals

kcarnold changed the base branch from main to claude/dreamy-noether-sh20l9 June 25, 2026 01:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge colleague evals into scenario_design pipeline#498

Merge colleague evals into scenario_design pipeline#498
kcarnold wants to merge 3 commits into
claude/dreamy-noether-sh20l9from
merge-colleague-evals

kcarnold commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kcarnold commented Jun 25, 2026

What & why

Changes

Verification

Note

Follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant