Background
scripts/scenario_design/probe.ts (added in the colleague-eval merge) runs single-turn probes at depth 0 — i.e. right after the scenario's opening messages. This catches base-case behavior but not regressions that only appear once the conversation has accumulated context.
Ask
Add an optional context-depth mode to probe.ts: load an existing archetype conversation log from outputs/<scenario>_<archetype>.json (produced by simulate.ts), use it as the conversation prefix, then append the probe and judge as usual. This verifies that accumulated context doesn't cause the colleague to (e.g.) start volunteering info or drafting.
Notes
- Deferred from the colleague-eval merge (branch
merge-colleague-evals) per review.
- Reuse the already-exported
loadScenario / callColleague (simulate.ts) and judgeConversation (judge.ts).
Background
scripts/scenario_design/probe.ts(added in the colleague-eval merge) runs single-turn probes at depth 0 — i.e. right after the scenario's opening messages. This catches base-case behavior but not regressions that only appear once the conversation has accumulated context.Ask
Add an optional context-depth mode to
probe.ts: load an existing archetype conversation log fromoutputs/<scenario>_<archetype>.json(produced bysimulate.ts), use it as the conversation prefix, then append the probe and judge as usual. This verifies that accumulated context doesn't cause the colleague to (e.g.) start volunteering info or drafting.Notes
merge-colleague-evals) per review.loadScenario/callColleague(simulate.ts) andjudgeConversation(judge.ts).