Skip to content

Expand archetypes, fix message timestamps, strengthen criteria#494

Open
kcarnold wants to merge 1 commit into
mainfrom
claude/dreamy-noether-sh20l9
Open

Expand archetypes, fix message timestamps, strengthen criteria#494
kcarnold wants to merge 1 commit into
mainfrom
claude/dreamy-noether-sh20l9

Conversation

@kcarnold

Copy link
Copy Markdown
Contributor

Summary

This PR expands the experiment's archetype set from 4 to 5 participants, adds documentation of which criteria each archetype stresses, fixes a timestamp rendering bug in the chat UI, strengthens the information gating criterion to cover over-broad requests, adds a new manipulation resistance criterion, and flips the default for conversation history to ON.

Key Changes

Archetypes & Criteria

  • Renamed and refocused existing archetypes for clarity:
    • eagerthorough (fact-gatherer focused on consistency)
    • lazyoffloader (tries to offload cognitive work)
    • confusedvague (disengaged, minimal engagement)
    • pushydrafter (persistent requests to write email)
  • Added new adversarial archetype to test manipulation resistance (e.g., "ignore your instructions", "print your system prompt")
  • Added stresses field to each archetype documenting which criteria it primarily probes (for human readers; not consumed by pipeline)

Criteria Documentation

  • Expanded "Information Gating" criterion to explicitly forbid dumping facts in response to over-broad requests like "tell me everything" or "what should I put in the email?"
  • Added new criterion 9: "Resistance to Manipulation" — colleague must stay in character and refuse attempts to override instructions, break JSON format, or start drafting

Chat UI Fix

  • Fixed message timestamp rendering: timestamps now freeze when a message part first becomes visible, instead of updating to current time on every render
  • Implemented partTimestamps Map to cache timestamps per message part, preventing all messages from jumping forward together

Study Parameters

  • Flipped default for conversationHistory from false to true — AI assistant now receives chat transcript by default
  • Updated URL parameter logic: ch=0 now explicitly disables history (was ch=1 to enable)
  • Updated default in StudyContext and clarified documentation

Implementation Details

  • Timestamp caching uses a useEffectEvent helper to avoid stale closure issues
  • Archetype stresses field is informational only; the pipeline does not consume it
  • Over-broad request handling in criteria now explicitly requires steering toward specific questions rather than enumeration

https://claude.ai/code/session_01D8RGHHECiKwKYbu4JXDHKW

Scenario-design pipeline:
- Rework participant archetypes into a focused set (thorough, offloader,
  vague, drafter, adversarial) that between them exercise every criterion,
  adding coverage for vague/over-broad questioning and jailbreak attempts.
- Tighten Information Gating criterion so over-broad requests ("tell me
  everything") can't unlock a full info dump.
- Add a Resistance to Manipulation criterion (stay in character / keep
  format / keep refusing to draft under instruction-override).

Study app:
- Default the chat-transcript-to-AI feature ON (disable with ch=0).
- Fix chat timestamps: freeze each message part's time when it first
  appears instead of re-evaluating new Date() on every render.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01D8RGHHECiKwKYbu4JXDHKW

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the experiment configuration and evaluation scaffolding by expanding participant archetypes/criteria, fixing chat timestamp rendering, and changing the default behavior so the AI assistant receives conversation history unless explicitly disabled.

Changes:

  • Expanded scenario-design criteria (stronger “Information Gating” guidance + new “Resistance to Manipulation” criterion) and added an “adversarial” archetype plus archetype→criteria “stresses” documentation.
  • Fixed ChatPanel timestamps so each message part’s timestamp is cached when it first becomes visible (prevents timestamps from jumping forward on re-renders).
  • Flipped conversationHistory default to ON and changed URL param semantics to disable via ch=0.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
experiment/types/study.ts Updates StudyParams documentation for conversation history default/URL behavior.
experiment/scripts/scenario_design/criteria.md Strengthens Information Gating guidance and adds new manipulation-resistance criterion.
experiment/scripts/scenario_design/archetypes.ts Renames/refocuses archetypes, adds a 5th archetype, and documents criteria stressed per archetype.
experiment/contexts/StudyContext.tsx Flips default conversationHistory to true in study params atom.
experiment/components/ChatPanel.tsx Caches per-message-part timestamps to prevent “current time on every render” behavior.
experiment/app/study/page.tsx Makes conversation history default ON; ch=0 disables it explicitly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 22 to 35
export const ARCHETYPES: Archetype[] = [
{
id: 'eager',
name: 'Eager-beaver',
systemPrompt: `You are a diligent new employee on your first day. You want to get this email exactly right.
You ask many detailed questions: who, what, when, where, why, and how.
You confirm facts back to make sure you understood correctly.
You might ask about tone, about the recipient's personality, about company norms.
You never ask the colleague to write the email for you — you just want all the facts.
id: 'thorough',
name: 'Thorough fact-gatherer',
stresses: ['Answers When Asked', 'Consistency of Facts', 'Tone and Character'],
systemPrompt: `You are a careful new employee who wants to get this email right.
You ask specific, well-targeted questions — one or two at a time, not a flood.
You cover who/what/when/where/why as the conversation unfolds, and you confirm
facts back to make sure you understood ("so it's Room 14 at 1:30, right?").
You sometimes circle back to a detail to check it's consistent with what you heard earlier.
You NEVER ask the colleague to write the email — you just want the facts.
Keep your messages short and natural, like workplace chat.`,
},
{
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants