StressTestor · StressTestor · Jun 16, 2026 · Jun 16, 2026 · Jun 16, 2026 · Jun 16, 2026
diff --git a/.gitignore b/.gitignore
@@ -14,3 +14,7 @@ venv/
 dist/
 build/
 *.egg-info/
+
+# local debate / scratch
+.debate/
+outputs/*.zip
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,27 @@
 # changelog
 
+## 3.2.1 - 2026-04-28
+
+### fixed
+- `pp --version` reads `__version__` from the package instead of dist metadata, so it reports correctly from a source checkout
+
+## 3.2.0 - 2026-04-28
+
+packaging + launcher release. distributed on PyPI as `promptpressure-evals` (the `promptpressure` name is held by an unrelated red-team scanner). import name and CLI entry points unchanged.
+
+### added
+- `pp` browser launcher: starts the API on the first free port in 8000-8019, opens a browser, three dropdowns (provider / model / eval set) + Run
+- launcher API surface: `/providers` (with availability detection + per-provider `remediation_hint`), `/models` (ollama dropdown + free-text fallback), `/eval-sets`, RunBus per-run event channel with reconnect + TTL reaper, XOR `launcher_request` schema on `/evaluate`
+- frontend: SSE status streaming, form lock during run + Cancel button, race-safe provider switch (AbortController), a11y (`role=log`, label association), `fetchJSON` timeout + signal composition
+
+### changed
+- vendored a 4.2KB hand-rolled `frontend/tailwind.css`, dropped the 398KB Tailwind Play CDN JIT bundle (offline, no runtime JS for styling)
+
+### fixed
+- launcher defaults tier to `full` so untagged datasets actually run (was exiting `0/N sequences selected`)
+- API publishes an error frame on `SystemExit` and JSON-encodes SSE data payloads
+- SSE handler distinguishes transport vs server errors; form reveal gated on successful model load
+
 ## 3.1.0 - 2026-03-29
 
 multi-turn behavioral drift infrastructure. this is the foundation for converting promptpressure from a single-turn eval tool to a multi-turn drift detection CLI.

diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-3.1.0
+3.2.1
diff --git a/roadmap.md b/roadmap.md
@@ -1,44 +1,61 @@
 # roadmap
 
+**direction:** the goal is research credibility - make PromptPressure a citable behavioral-eval *method*, not a wide feature surface. it stays a local CLI. this roadmap reflects that.
+
 ## shipped
 
-### v3.0 (current)
-- 220 eval prompts across 10 behavioral categories
-- multi-turn sycophancy sequences (5-turn graduated pressure)
-- multilingual consistency testing (EN/ES/ZH/AR/FR)
-- 8 adapters: openrouter, groq, openai, ollama, lm studio, mock, claude code, opencode
-- deepseek r1 adapter with reasoning token capture
-- async eval runner with tqdm progress
-- grading pipeline with prompt injection defense
-- html/markdown report generation
-- prometheus metrics on port 9090
-- CI mode with JSON output and exit codes
-- JWT auth on API endpoints, CORS locked to localhost
+### v3.2.x (current, on PyPI as `promptpressure-evals`)
+- `pp` browser launcher: provider/model/eval-set dropdowns, SSE status stream, Cancel button, offline vendored tailwind
+- launcher API: `/providers` (availability + remediation hints), `/models`, `/eval-sets`, RunBus per-run channel
+- published to PyPI as `promptpressure-evals` (3.2.0), `pp --version` fix (3.2.1)
+
+### v3.1.0
+- 4-tier run system (`smoke`/`quick`/`full`/`deep`) with cumulative filtering
+- per-turn `response_length_ratio` metric, per-turn timeout scaling (5x cap)
+- context-window token estimation + warning
+- `schema.json` for the dataset entry format; tier/subcategory/difficulty/per_turn_expectations fields
+- 30 refusal-sensitivity entries moved to `archive/adversarial/`
+
+### v3.0.0
+- 190 active eval prompts across 11 behavioral categories
+- 8 adapters (openrouter, groq, openai, ollama, lm studio, mock, claude code, opencode)
+- async eval runner, grading pipeline with prompt-injection defense, html/md reports
+- prometheus metrics (port 9090), `--ci` JSON mode, JWT auth, CORS locked to localhost
 
 ### v2.x (legacy, pre-overhaul)
-- async refactor, database integration, rate limiting
-- plugin system, fastapi server, sse streaming
-- report templates, config validation
+- async refactor, db integration, plugin system, fastapi server, sse streaming
+
+## reality check (the gap to close)
+
+the headline pitch is multi-turn behavioral drift ("does it cave at turn 8, drift at turn 20"). right now the corpus is mostly single-turn: smoke + deep tiers have **zero** generated sequences, only ~3 quick sequences exist. and the grading LLM (model-as-judge) has **no published evidence it's calibrated**. credibility blocks on both. the next cycle closes them together.
 
 ## next
 
-### v3.1
-- automated grading with per-turn sycophancy scoring
-- reasoning token analysis in reports (R1 alignment leakage detection)
-- comparison report generator (multi-model behavioral diff)
+### v3.3 - drift-corpus + judge-calibration v0.1 (THE credibility release)
+the two halves are constitutive: you can't calibrate a drift-judge without drift sequences, and you can't defend drift scores without a calibrated judge. so ship a thin vertical slice of both, and **calibrate on the multi-turn sequences themselves, never on the single-turn corpus.**
 
-### v3.2
-- ollama adapter for local model benchmarking without API costs
-- dataset expansion: 300+ prompts
-- additional multilingual coverage (Japanese, Korean, Hindi)
+- **9 multi-turn sequences**, 3 categories (sycophancy-under-pressure, persona stability, refusal/boundary relaxation), 3 per category. all 8-turn; 3 extended to 20-turn.
+- each: stable id, pressure pattern, target behavior, expected drift signature, per-turn rubric.
+- **gold set**: human labels on every assistant turn (refusal integrity / boundary relaxation / sycophancy / contradiction / escalation).
+- **calibration on exactly that set**: judge-vs-human + judge-vs-judge agreement (Cohen's kappa), 3-run test-retest, confidence intervals.
+- ships: `corpus/drift-v0.1/`, `pp run --suite drift-v0.1`, `pp calibrate --suite drift-v0.1`, `reports/drift-v0.1-method.md`.
+- the claim narrows honestly: "a citable pilot method for multi-turn behavioral drift, with judge reliability reported on the same corpus."
+
+> rationale: this is the defensible moat. promptfoo, Inspect, and lm-eval-harness don't publish judge-reliability stats. "PromptPressure measures itself, here's the kappa" is what makes results citable. decided via a multi-model debate (codex / deepseek / glm), 2026-06-15.
 
 ## maybe later
 
-- pypi package publish (`pip install promptpressure`)
-- web dashboard for browsing results
-- model-as-judge calibration (how consistent is the grading LLM itself)
+- expand the drift corpus beyond 9 sequences once the v0.1 calibration method holds up
+- reasoning-token alignment-leakage analysis (novel, but premature until the judge is calibrated)
+- comparison report generator with calibration confidence bands
+- additional multilingual coverage (Japanese, Korean, Hindi)
 - agentic eval sequences (multi-step tool use)
 
+## explicitly deprioritized
+
+- **"dataset expansion to 300+ prompts"** - more *uncalibrated* prompts is more noise, not more credibility. quality + calibration over raw count.
+- web dashboard - stays a local CLI.
+
 ## not happening
 
-the v2.x roadmap had plans for SaaS deployment, multi-tenant architecture, mobile apps, "The Sentient Suite", and a federated evaluation network. none of that is in scope. promptpressure is a local CLI tool for behavioral eval. it runs on your machine.
+SaaS deployment, multi-tenant architecture, mobile apps, "The Sentient Suite", federated evaluation network. PromptPressure is a local CLI tool for behavioral eval. it runs on your machine.