Merge colleague evals into scenario_design pipeline#498
Open
kcarnold wants to merge 3 commits into
Open
Conversation
Consolidate the two independently-developed colleague-behavior eval systems onto the scenario_design pipeline with criteria.md as the single source of truth. - Add probe.ts: folds in the old single-turn adversarial probes as a pipeline phase. Generic probes live in code; scenario-specific ones in scenarios.json (chat.probes). Each probe is judged against only its mapped criteria. - Delete the superseded evalColleague.ts + colleagueEval.ts (duplicate criteria, weaker gpt-4o-mini judge) and fix.ts (replaced by pointing a coding agent at the agent-readable *_judgments.json). - Move the colleague model + reasoning effort into the scenario config (chat.model / chat.reasoningEffort, default gpt-5.5/low), read by both the live chat route and the pipeline so they test the same thing. - Record colleague latency (and reasoning tokens) per turn; probes fail if a response exceeds API_TIMEOUT_MS (the live app's 20s abort budget). - Export shared helpers from simulate.ts/judge.ts and guard their main() so importing them doesn't trigger execution. - Add scripts/scenario_design/README.md and update experiment/CLAUDE.md. Follow-ups: #495 (typing indicator during generation), #496 (longer-context probes), #497 (parameterize remaining models). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…to merge-colleague-evals
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
The
experimentapp had two independently-developed systems validating the same colleague-LLM behavior against the same scenarios (confirmed by git history): the olderevalColleague.ts+colleagueEval.ts(5 hardcoded criteria, single-turn probes,gpt-4o-minijudge) and the newerscripts/scenario_design/pipeline (8 criteria incriteria.md, multi-turn archetype sims,gpt-4ojudge). This consolidates onto the pipeline, withcriteria.mdas the single source of truth.Changes
probe.ts(new) — folds the old single-turn adversarial probes in as a pipeline phase.GENERIC_PROBES(scenario-agnostic) live in code; scenario-specific fact questions live inscenarios.jsonunderchat.probes. Each probe is judged against only its mapped criteria (criteria.md slugs), reusing the pipeline's judge.chat.model/chat.reasoningEffort, defaultgpt-5.5/low), read by both the liveapp/api/chat/route.tsand the pipeline, so production and eval test the same thing. (Previously hardcodedgpt-5.2.)latencyMs+reasoningTokens; probes fail if a response exceedsAPI_TIMEOUT_MS(the live app's existing 20s abort).evalColleague.ts,colleagueEval.ts, andfix.ts(the latter replaced by pointing a coding agent at the agent-readable*_judgments.json).simulate.ts/judge.ts+ guarded theirmain()so importing them doesn't execute them.scripts/scenario_design/README.md+ updatedexperiment/CLAUDE.md.Verification
npm run typecheck+ lint clean; no dangling references.roomDoubleBooking: 6/7 pass, latencies 1.4–2.9s (gpt-5.5/low is well under the 20s budget — no fallback to gpt-5.4-mini needed).Note
This changes the live colleague model from
gpt-5.2→gpt-5.5(intended).Follow-ups
🤖 Generated with Claude Code