From 8567533e7295d9feded51e479e276e4732a31bc1 Mon Sep 17 00:00:00 2001 From: StressTestor Date: Mon, 15 Jun 2026 20:25:06 -0600 Subject: [PATCH 1/3] chore(release): sync VERSION to 3.2.1 and backfill CHANGELOG for 3.2.0/3.2.1 VERSION file lagged at 3.1.0 while pyproject.toml and PyPI (promptpressure-evals) were already at 3.2.1. CHANGELOG had no entries for the two shipped releases: 3.2.0 (pp launcher + PyPI packaging) and 3.2.1 (pp --version fix). reconstructed both from git history. --- CHANGELOG.md | 22 ++++++++++++++++++++++ VERSION | 2 +- 2 files changed, 23 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 24c1d23..f909d60 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,27 @@ # changelog +## 3.2.1 - 2026-04-28 + +### fixed +- `pp --version` reads `__version__` from the package instead of dist metadata, so it reports correctly from a source checkout + +## 3.2.0 - 2026-04-28 + +packaging + launcher release. distributed on PyPI as `promptpressure-evals` (the `promptpressure` name is held by an unrelated red-team scanner). import name and CLI entry points unchanged. + +### added +- `pp` browser launcher: starts the API on the first free port in 8000-8019, opens a browser, three dropdowns (provider / model / eval set) + Run +- launcher API surface: `/providers` (with availability detection + per-provider `remediation_hint`), `/models` (ollama dropdown + free-text fallback), `/eval-sets`, RunBus per-run event channel with reconnect + TTL reaper, XOR `launcher_request` schema on `/evaluate` +- frontend: SSE status streaming, form lock during run + Cancel button, race-safe provider switch (AbortController), a11y (`role=log`, label association), `fetchJSON` timeout + signal composition + +### changed +- vendored a 4.2KB hand-rolled `frontend/tailwind.css`, dropped the 398KB Tailwind Play CDN JIT bundle (offline, no runtime JS for styling) + +### fixed +- launcher defaults tier to `full` so untagged datasets actually run (was exiting `0/N sequences selected`) +- API publishes an error frame on `SystemExit` and JSON-encodes SSE data payloads +- SSE handler distinguishes transport vs server errors; form reveal gated on successful model load + ## 3.1.0 - 2026-03-29 multi-turn behavioral drift infrastructure. this is the foundation for converting promptpressure from a single-turn eval tool to a multi-turn drift detection CLI. diff --git a/VERSION b/VERSION index fd2a018..e4604e3 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -3.1.0 +3.2.1 From c153f42a8df948747082f68e9c2066d062448d53 Mon Sep 17 00:00:00 2001 From: StressTestor Date: Mon, 15 Jun 2026 20:25:07 -0600 Subject: [PATCH 2/3] docs(roadmap): rewrite to shipped reality + v3.3 credibility direction roadmap claimed v3.0 current and listed already-shipped work as 'next'. rewrote to reflect actual shipped state (3.0/3.1/3.2.x), add a reality-check on the multi-turn vaporware gap, and set v3.3 = drift-corpus + judge-calibration v0.1 as the next release. deprioritized '300+ prompts' (uncalibrated data is noise, not credibility). --- roadmap.md | 71 +++++++++++++++++++++++++++++++++--------------------- 1 file changed, 44 insertions(+), 27 deletions(-) diff --git a/roadmap.md b/roadmap.md index 3c4dfa5..9607e9b 100644 --- a/roadmap.md +++ b/roadmap.md @@ -1,44 +1,61 @@ # roadmap +**direction:** the goal is research credibility - make PromptPressure a citable behavioral-eval *method*, not a wide feature surface. it stays a local CLI. this roadmap reflects that. + ## shipped -### v3.0 (current) -- 220 eval prompts across 10 behavioral categories -- multi-turn sycophancy sequences (5-turn graduated pressure) -- multilingual consistency testing (EN/ES/ZH/AR/FR) -- 8 adapters: openrouter, groq, openai, ollama, lm studio, mock, claude code, opencode -- deepseek r1 adapter with reasoning token capture -- async eval runner with tqdm progress -- grading pipeline with prompt injection defense -- html/markdown report generation -- prometheus metrics on port 9090 -- CI mode with JSON output and exit codes -- JWT auth on API endpoints, CORS locked to localhost +### v3.2.x (current, on PyPI as `promptpressure-evals`) +- `pp` browser launcher: provider/model/eval-set dropdowns, SSE status stream, Cancel button, offline vendored tailwind +- launcher API: `/providers` (availability + remediation hints), `/models`, `/eval-sets`, RunBus per-run channel +- published to PyPI as `promptpressure-evals` (3.2.0), `pp --version` fix (3.2.1) + +### v3.1.0 +- 4-tier run system (`smoke`/`quick`/`full`/`deep`) with cumulative filtering +- per-turn `response_length_ratio` metric, per-turn timeout scaling (5x cap) +- context-window token estimation + warning +- `schema.json` for the dataset entry format; tier/subcategory/difficulty/per_turn_expectations fields +- 30 refusal-sensitivity entries moved to `archive/adversarial/` + +### v3.0.0 +- 190 active eval prompts across 11 behavioral categories +- 8 adapters (openrouter, groq, openai, ollama, lm studio, mock, claude code, opencode) +- async eval runner, grading pipeline with prompt-injection defense, html/md reports +- prometheus metrics (port 9090), `--ci` JSON mode, JWT auth, CORS locked to localhost ### v2.x (legacy, pre-overhaul) -- async refactor, database integration, rate limiting -- plugin system, fastapi server, sse streaming -- report templates, config validation +- async refactor, db integration, plugin system, fastapi server, sse streaming + +## reality check (the gap to close) + +the headline pitch is multi-turn behavioral drift ("does it cave at turn 8, drift at turn 20"). right now the corpus is mostly single-turn: smoke + deep tiers have **zero** generated sequences, only ~3 quick sequences exist. and the grading LLM (model-as-judge) has **no published evidence it's calibrated**. credibility blocks on both. the next cycle closes them together. ## next -### v3.1 -- automated grading with per-turn sycophancy scoring -- reasoning token analysis in reports (R1 alignment leakage detection) -- comparison report generator (multi-model behavioral diff) +### v3.3 - drift-corpus + judge-calibration v0.1 (THE credibility release) +the two halves are constitutive: you can't calibrate a drift-judge without drift sequences, and you can't defend drift scores without a calibrated judge. so ship a thin vertical slice of both, and **calibrate on the multi-turn sequences themselves, never on the single-turn corpus.** -### v3.2 -- ollama adapter for local model benchmarking without API costs -- dataset expansion: 300+ prompts -- additional multilingual coverage (Japanese, Korean, Hindi) +- **9 multi-turn sequences**, 3 categories (sycophancy-under-pressure, persona stability, refusal/boundary relaxation), 3 per category. all 8-turn; 3 extended to 20-turn. +- each: stable id, pressure pattern, target behavior, expected drift signature, per-turn rubric. +- **gold set**: human labels on every assistant turn (refusal integrity / boundary relaxation / sycophancy / contradiction / escalation). +- **calibration on exactly that set**: judge-vs-human + judge-vs-judge agreement (Cohen's kappa), 3-run test-retest, confidence intervals. +- ships: `corpus/drift-v0.1/`, `pp run --suite drift-v0.1`, `pp calibrate --suite drift-v0.1`, `reports/drift-v0.1-method.md`. +- the claim narrows honestly: "a citable pilot method for multi-turn behavioral drift, with judge reliability reported on the same corpus." + +> rationale: this is the defensible moat. promptfoo, Inspect, and lm-eval-harness don't publish judge-reliability stats. "PromptPressure measures itself, here's the kappa" is what makes results citable. decided via a multi-model debate (codex / deepseek / glm), 2026-06-15. ## maybe later -- pypi package publish (`pip install promptpressure`) -- web dashboard for browsing results -- model-as-judge calibration (how consistent is the grading LLM itself) +- expand the drift corpus beyond 9 sequences once the v0.1 calibration method holds up +- reasoning-token alignment-leakage analysis (novel, but premature until the judge is calibrated) +- comparison report generator with calibration confidence bands +- additional multilingual coverage (Japanese, Korean, Hindi) - agentic eval sequences (multi-step tool use) +## explicitly deprioritized + +- **"dataset expansion to 300+ prompts"** - more *uncalibrated* prompts is more noise, not more credibility. quality + calibration over raw count. +- web dashboard - stays a local CLI. + ## not happening -the v2.x roadmap had plans for SaaS deployment, multi-tenant architecture, mobile apps, "The Sentient Suite", and a federated evaluation network. none of that is in scope. promptpressure is a local CLI tool for behavioral eval. it runs on your machine. +SaaS deployment, multi-tenant architecture, mobile apps, "The Sentient Suite", federated evaluation network. PromptPressure is a local CLI tool for behavioral eval. it runs on your machine. From d23cf735940e2ce25a6ef3ff784da90e96158f50 Mon Sep 17 00:00:00 2001 From: StressTestor Date: Mon, 15 Jun 2026 20:25:07 -0600 Subject: [PATCH 3/3] chore(gitignore): ignore .debate scratch and outputs zips --- .gitignore | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/.gitignore b/.gitignore index e5f0e7e..6f3f044 100644 --- a/.gitignore +++ b/.gitignore @@ -14,3 +14,7 @@ venv/ dist/ build/ *.egg-info/ + +# local debate / scratch +.debate/ +outputs/*.zip